expgram

expgram is an ngram toolkit which can efficiently handle large ngram data:

A succinct data structure for compactly represent ngram data¹. Among ngram compression methods mentioned in², we do not implement block-wise compression (or zlib every 8k-byte) for computational efficiency reason.
Language model is estimated by MapReduce proposed by³ using pthread and/or MPI.
Better rest cost estimation for chart-based decoding in machine translation which estimates lower-order ngram language model parameters⁴.
A transducer-like interface motivated by⁵ and an efficient prefix/suffix ngram context computation⁶.

Note this toolkit is primarily developed to handle large ngram count data, thus it is not called like xxxlm.

The expgram toolkit is mainly developed by Taro Watanabe at Multilingual Translation Laboratory, Universal Communication Institute, National Institute of Information and Communications Technology (NICT). If you have any questions about expgram, you can send them to taro.watanabe at nict dot go dot jp.

Quick Start

The stable version is: 0.2.1. The latest code is also available from github.com.

Compile

For details, see BUILD.rst.

./autogen.sh (required when you get the code by git clone)
./configure
make
make install (optional)

Run

Basically, you have only to use expgram.py (found at <build dir>/scripts or <install prefix>/bin) which encapsulate all the processes to estimate LM. For instance, you can run:

expgram.py \
         --corpus <corpus> or --corpus-list <list of corpus> \
     --output <prefix of lm name> \
     --order  <order of ngram lm> \
     --temporary-dir <temporary disk space>

Here, we assume a corpus, newline delimited set of sentences, indicated by --corpus <corpus> or a list of corpus, newline delimited set of corpora files specified by --corpus-list <list of corpus>. This will dump 6 data:

<prefix>.counts        extracted ngram counts
<prefix>.index     indexed ngram counts
<prefix>.modified      indexed modified counts for modified-KN smoothing
<prefix>.estimated     temporarily estiamted LM (don't use this!)
<prefix>.lm        LM with efficient indexing
<prefix>.lm.quantize   8-bit quantized LM

or, if you already have count data organized into a Google format, simply run

expgram.py \
     --counts <counts in Google format> \
     --output <prefix of lm name> \
     --order  <order of ngram lm> \
     --temporary-dir <temporary disk space>

This will dump 5 models:

<prefix>.index     indexed ngram counts
<prefix>.modified      indexed modified counts for modified-KN smoothing
<prefix>.estimated     temporarily estiamted LM (don't use this!)
<prefix>.lm        LM with efficient indexing
<prefix>.lm.quantize   8-bit quantized LM

To see the indexed counts, use (found at <build dir>/progs or <install prefix>/bin):

expgram_counts_dump --ngram <prefix>.index

which writes the indexed counts in a plain text. The language model probabilities are stored by the natural logarithm (with e as a base), not by the logarithm with base 10. If you want to see the LM, use:

expgram_dump --ngram <prefix>.lm (or <prefix>.lm.quantize)

which writes LM in ARPA format using the common logarithm with base 10.

expgram_perplexity --ngram <prefix>.lm (or <prefix>.lm.quantize) < [text-file]

computes the perplexity on the text-file.

Systems

It has been successfully compiled on x86_64 on Linux, OS X and Cygwin, and regularly tested on Linux and OS X.

References

Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. A succinct n-gram language model. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 341-344, Suntec, Singapore, August 2009. Association for Computational Linguistics.↩
Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. A succinct n-gram language model. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 341-344, Suntec, Singapore, August 2009. Association for Computational Linguistics.↩
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858-867, Prague, Czech Republic, June 2007. Association for Computational Linguistics.↩
Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language model rest costs and space-efficient storage. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1169-1178, Jeju Island, Korea, July 2012. Association for Computational Linguistics.↩
Jeffrey Sorensen and Cyril Allauzen. Unary data structures for language models. In Interspeech 2011, pages 1425-1428, 2011.↩
Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language model rest costs and space-efficient storage. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1169-1178, Jeju Island, Korea, July 2012. Association for Computational Linguistics.↩

Name		Name	Last commit message	Last commit date
Latest commit History 779 Commits
config		config
doc		doc
expgram		expgram
man		man
progs		progs
sample		sample
scripts		scripts
succinct_db		succinct_db
utils		utils
.gitignore		.gitignore
BUILD.rst		BUILD.rst
COPYING.GPL		COPYING.GPL
COPYING.LGPL		COPYING.LGPL
FAQ		FAQ
LICENSE		LICENSE
Makefile.am		Makefile.am
NEWS.rst		NEWS.rst
README.rst		README.rst
TODO.rst		TODO.rst
autogen.sh		autogen.sh
configure.ac		configure.ac

License

Licenses found

tarowatanabe/expgram

Folders and files

Latest commit

History

Repository files navigation

expgram

Quick Start

Compile

Run

Systems

References

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages