GitHub - alexvking/Learning-Language-Classifier: Language and academic subject classification tool

classify: TEXT CLASSIFIER WITH LEARNING CAPABILITY

By Alex King

SUMMARY

This is a text classifier written in Python written in a functional style. It relies on a stored body of "language training" documents to build its reference system. See "MODIFYING AND EXTENDING THE REFERENCE SYSTEM" below for information on how to add additional classification modes.

classify provides three example classification modes: language, subject and code.

--lang will envoke language mode, which recognizes input text in the following languages:

Czech
English
French
German
Hungarian
Italian
Polish
Russian
Spanish

--subject will envoke subject mode, which expects an academic paper (or similar) written in English, and will recognize the following subjects of study:

Computer Science
Biology
Psychology
Economics

--code will envoke code mode, which recognizes source code files of the following types:

C/C++
C#
COBOL
Java
Lisp Family
Objective-C
Python

USAGE

classify expects either plain formatted text specified as a file or a web address from the command line.

To run language recognition, run with python classify.py --lang [input.txt|http://webpage.com].

To run academic subject recognition, run with --subject instead.

It is convenient to run pdftotext [paper.pdf] to extract plain text from an academic paper formatted as pdf.

Examples:

--lang samples/polish.txt will classify a Polish text as Polish.
--subject samples/econ6.txt will classify an Economics paper as Economics.
--lang http://lemonde.fr will classify a popular French newspaper as French.
--subject http://bigocheatsheet.com will classify the site as Computer Science!
--subject http://nytimes.com will classify the New York Times as Economics.

NEW IN VERSION 0.7.0: Upon classification, classify will ask the user if the classification is correct or not. In the event that it isn't, classify will ask permission to copy the input file (or html document) into the training library to strengthen its classification. This is a powerful feature that allows for rapid improvement. As soon as a hole is found in the training models, the hole can be partially plugged with the provided document. In theory, after many runs of this program, the training library could become incredibly strong.

ALGORITHM IN DETAIL

classify is an example of simple machine learning that leads to an oddly powerful end result. The goal is to classify ASCII text into some sort of type within a category. The provided categories are natural language, programming language, and academic subject. To do this, classify must have a method of creating a model for each type within a category, and a model for the input document. Then, classify must have a method of comparing the predefined models to input to guess what type is correct.

To generate models, classify uses databases of "trigrams" instead of a database of words. This idea came from an assignment from Norman Ramsey's COMP 50 course at Tufts University in Fall 2013. Please note that the natural language training documents are also from that course.

A trigram is a string of three contiguous characters from input. For example, "Hello!" would yield the trigrams "Hel", "ell", "llo", and "lo!". Trigrams allow for granular recognition of roots, prefixes, suffixes, and generally discipline-specific terminology.

Trigrams are counted and summed as an association list, in this case a Python dictionary. Though counting the number of occurrences of a trigram isn't strictly necessary for model comparison, (versus just checking if the trigram occurs at all), it is a useful step for other data crunching purposes.

After models are generated, the input model is compared to each type's model within the specified category. For example, if a user runs classify --subject on a provided Economics document, classify will check the document against its models for Psychology, Biology, Computer Science and Economics. Model similarity is derived from bit vector similarity -- essentially summing the number of similar trigrams between two models, and dividing by the size of the model. A perfect score of 1.0 indicates two identical models, while a score of 0 indicates two completely unrelated models with no common trigrams. Depending on the training library and input, most comparisons yield scores between 0.2 and 0.6.

MODIFYING AND EXTENDING THE REFERENCE SYSTEM

classify's quality is limited only by its training corpus. A larger variety of documents will lead to more accurate classifications. A mode can be added or changed very easily by adding directories and files in the following format:

A directory with name of "mode-" in same directory as classify.py
Within mode directory, directories named after different types within mode
Within each type, plain text files representing the type

See folders "mode-lang" and "mode-subject" for examples.

classify will automatically recognize the new mode and it will be usable from the command line with the expected --mode switch.

Also note that the reference system can be bolstered one document at a time by "training" it upon incorrect classification (see USAGE above).

KNOWN ISSUES AND PLANNED IMPROVEMENTS

As of Version 0.7.0, classify has evolved into more of a proof of concept of simple machine learning than a tool focused on practical utility. Language recognition is generally very accurate due to the great training library provided, but subject and code recognition exist mostly as proofs of concept now. There are no immediate plans to bolster their training libraries, because it can now be done easily through the program itself.

The next step is to add handling of new types within a mode. For instance, if one were to attempt and classify an Astronomy document, the user could tell classify that it was not one of the preexisting types, but rather a new one entirely. This would make it very fast to categorize a lot of different documents, particularly websites.

Other issues (pre-0.7.0:)

classify currently has no way of knowing if the supplied text is in English for the --subject option, so it may silently give wrong answers.

A simple GUI with rich file selection would be a great next step for this software. It could also be extended easily to work on programming languages.

More academic subjects may be added soon. There is no obvious way to find a random sampling of various material, so this work is tedious. It exists as an option right now as a proof of concept more than anything else. Same goes for the code mode.

VERSION HISTORY AND RELEASE NOTES

8/2/15 VERSION 0.8.0

Added --audio as another, very experimental option. Works with both artists and genres as 'types'. Hand-picked reference material from personal collection.
Added fingerprint.py; program for turning music files into fingerprints, text files that classify.py uses for audio-mode. See fingerprint/readme.md for more details.

5/29/15 VERSION 0.7.5

Added stripping of HTML tags to better classify websites
Fixed bug where certain web addresses wouldn't translate to file paths properly

12/28/14 VERSION 0.7.0

Added ability to "train" the classifier upon incorrect classification by copying input file (or website) to the training library. This is a big next step that turns classify into more of an example of machine learning.

12/22/14 VERSION 0.6.5

Added automatic recognition of modes to make it easy to extend classify's usability. Usage switches will be automatically recognized.
Added --code as another experimental option. The reference material, taken from rosettacode.org, is currently very limited.

12/20/14 VERSION 0.6.0

Added language recognition of web pages. Use a web address beginning with http://.

12/20/14 VERSION 0.5.0

Rewritten in Python: a teaching experiment for the author, and an exercise in recognizing what Python is so good at. Python has so many convenient looping and mapping methods built in, it is far easier to manage directories of training documents. The original Racket version was several hundred lines of code, the C version was around 250, and the Python version is around 100.
Returning support of natural language recognition! Run as option --lang. This was made far easier to implement after porting to Python.

11/15/14 VERSION 0.2.0

Support for Economics
Initial git commit

11/14/14 VERSION 0.1.0

Initial release
Support for three subjects: Computer Science, Psychology, Biology

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
fingerprint		fingerprint
mode-audio		mode-audio
mode-code		mode-code
mode-lang		mode-lang
mode-subject		mode-subject
samples		samples
saved		saved
classify.py		classify.py
cls.py		cls.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fingerprint

fingerprint

mode-audio

mode-audio

mode-code

mode-code

mode-lang

mode-lang

mode-subject

mode-subject

samples

samples

saved

saved

classify.py

classify.py

cls.py

cls.py

readme.md

readme.md

Repository files navigation

classify: TEXT CLASSIFIER WITH LEARNING CAPABILITY

SUMMARY

USAGE

ALGORITHM IN DETAIL

MODIFYING AND EXTENDING THE REFERENCE SYSTEM

KNOWN ISSUES AND PLANNED IMPROVEMENTS

VERSION HISTORY AND RELEASE NOTES

About

Releases

Packages

Contributors 2

Languages

alexvking/Learning-Language-Classifier

Folders and files

Latest commit

History

Repository files navigation

classify: TEXT CLASSIFIER WITH LEARNING CAPABILITY

SUMMARY

USAGE

ALGORITHM IN DETAIL

MODIFYING AND EXTENDING THE REFERENCE SYSTEM

KNOWN ISSUES AND PLANNED IMPROVEMENTS

VERSION HISTORY AND RELEASE NOTES

About

Resources

Stars

Watchers

Forks

Languages