Skip to content

tarowatanabe/expgram

expgram

expgram is an ngram toolkit which can efficiently handle large ngram data:

  • A succinct data structure for compactly represent ngram data1. Among ngram compression methods mentioned in2, we do not implement block-wise compression (or zlib every 8k-byte) for computational efficiency reason.
  • Language model is estimated by MapReduce proposed by3 using pthread and/or MPI.
  • Better rest cost estimation for chart-based decoding in machine translation which estimates lower-order ngram language model parameters4.
  • A transducer-like interface motivated by5 and an efficient prefix/suffix ngram context computation6.

Note this toolkit is primarily developed to handle large ngram count data, thus it is not called like xxxlm.

The expgram toolkit is mainly developed by Taro Watanabe at Multilingual Translation Laboratory, Universal Communication Institute, National Institute of Information and Communications Technology (NICT). If you have any questions about expgram, you can send them to taro.watanabe at nict dot go dot jp.

Quick Start

The stable version is: 0.2.1. The latest code is also available from github.com.

Compile

For details, see BUILD.rst.

./autogen.sh (required when you get the code by git clone)
./configure
make
make install (optional)

Run

Basically, you have only to use expgram.py (found at <build dir>/scripts or <install prefix>/bin) which encapsulate all the processes to estimate LM. For instance, you can run:

expgram.py \
         --corpus <corpus> or --corpus-list <list of corpus> \
     --output <prefix of lm name> \
     --order  <order of ngram lm> \
     --temporary-dir <temporary disk space>

Here, we assume a corpus, newline delimited set of sentences, indicated by --corpus <corpus> or a list of corpus, newline delimited set of corpora files specified by --corpus-list <list of corpus>. This will dump 6 data:

<prefix>.counts        extracted ngram counts
<prefix>.index     indexed ngram counts
<prefix>.modified      indexed modified counts for modified-KN smoothing
<prefix>.estimated     temporarily estiamted LM (don't use this!)
<prefix>.lm        LM with efficient indexing
<prefix>.lm.quantize   8-bit quantized LM

or, if you already have count data organized into a Google format, simply run

expgram.py \
     --counts <counts in Google format> \
     --output <prefix of lm name> \
     --order  <order of ngram lm> \
     --temporary-dir <temporary disk space>

This will dump 5 models:

<prefix>.index     indexed ngram counts
<prefix>.modified      indexed modified counts for modified-KN smoothing
<prefix>.estimated     temporarily estiamted LM (don't use this!)
<prefix>.lm        LM with efficient indexing
<prefix>.lm.quantize   8-bit quantized LM

To see the indexed counts, use (found at <build dir>/progs or <install prefix>/bin):

expgram_counts_dump --ngram <prefix>.index

which writes the indexed counts in a plain text. The language model probabilities are stored by the natural logarithm (with e as a base), not by the logarithm with base 10. If you want to see the LM, use:

expgram_dump --ngram <prefix>.lm (or <prefix>.lm.quantize)

which writes LM in ARPA format using the common logarithm with base 10.

expgram_perplexity --ngram <prefix>.lm (or <prefix>.lm.quantize) < [text-file]

computes the perplexity on the text-file.

Systems

It has been successfully compiled on x86_64 on Linux, OS X and Cygwin, and regularly tested on Linux and OS X.

References


  1. Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. A succinct n-gram language model. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 341-344, Suntec, Singapore, August 2009. Association for Computational Linguistics.

  2. Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. A succinct n-gram language model. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 341-344, Suntec, Singapore, August 2009. Association for Computational Linguistics.

  3. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858-867, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

  4. Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language model rest costs and space-efficient storage. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1169-1178, Jeju Island, Korea, July 2012. Association for Computational Linguistics.

  5. Jeffrey Sorensen and Cyril Allauzen. Unary data structures for language models. In Interspeech 2011, pages 1425-1428, 2011.

  6. Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language model rest costs and space-efficient storage. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1169-1178, Jeju Island, Korea, July 2012. Association for Computational Linguistics.

About

expgram: an ngram toolkit with succinct storage

Resources

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE
GPL-3.0
COPYING.GPL
LGPL-3.0
COPYING.LGPL

Stars

Watchers

Forks

Packages

No packages published