C++ program calculating the entropy of a language based on a corpus of texts.
The usage template is as follow : ./entropy [command] [store_filename] [additional options]
More precisely, there are three commands available :
assimilate
takes a text file as imput and computes the conditionnal probability of appearance of each character/word(*) assuming it follows a nth order stationary markov model. All data is then stored in a file. The exact usage is./entropy assimilate [store_filename] [input_filename] [markov_order]
calculate
computes the entropy of the markov model (see explanation for more details). Usage :./entropy calculate [store_filename] [markov_order]
generate
is undoubtedly the most fancy feature. It randomly generates text based on the probabilities computed with theassimilate
command. Usage :./entropy generate [store_filename] [output_filename] [text_length] [markov_order]
We assume the frequency of each letter follows a stationary markov model of order k. This means that the probability equals for all n. More precisely, it equals with p being a function independent from n (the markov chain is supposed stationnary).
I added a few python scripts to show how the code can be used to calulate the entropy of, say, french.
-
First you should download the raw material using :
$ python Scripts/french_texts_dl.py $ python Scripts/code_civil_dl.py
(You may want to launch the second script in the Text folder created by the first one to regroup all files)
-
Then launch the program on all files (don't forget to adjust the markov order in the python file)
$ python Scripts/assimilate_folder.py
Of course you can change any parameter you want in the python files ;)