Authors

Jeanselme Vincent
Clin Matthieu

Project

This project has been undertook from May 24th to June 10th 2016. We want to recognize text from a scanned image. However, we do not use Neural Network in order to understand every issue and complexity of the problem.

Pre Processing

This section presents the different techniques used in order to clean the image and correct its rotation.

Filters

Several filters exist, however few have good results on test images where characters are considered as noise. It is why the bilateral and the median filters that are implemented are not used in the current version due to poor results. Linears filters present more interesting results. The software uses a linear filter with a mask which corresponds to a gaussian correction. This way, it reduces the noise.

Binarization

A global Otsu binarization is used in order to remove all the softened noise. However this method could be more accurate by a local binarization.

Rotation

First, the projection in the Hough space is computed. Thus, we implemented an heuristic in order to find the rotation needed in order to have the highest number of straight lines parallel to the bottom of the page.

Further work

Enhance the text by increasing the contrast.
Find other filters in order to delete all the remaining noise.
Cut the frame of the text.
Correct other distortions.

Segmentation algorithms

This section will present various segmentation algorithms implemented in this project. Algorithms are presented in their order of implementation (which also means complexity in this case).

Sliding window

The sliding window algorithm is the easiest way to dissociate lines and characters but it is also the most limited one. The principle is to look for a row (or column) where all pixels are white (meaning that there is a space between two lines of text or two characters).

###The pros :

Easy implementation
Able to spot paragraphs and words (bigger space)

###The cons :

Noise will be considered as line
Low DPI that leads to missing pixels (cutting a character in two) will be considered as two distinct characters.
A noise pixel which links two characters will not be cut by this method

Overlapping segmentation

The overlapping segmentation algorithm comes from the publication : Separation of overlapping character N. Papamarkos and T. Koutalianos Department of Electrical and Computer Engineering Democritus University of Thrace, 67100 Xanthi, Greece.. This algorithm shows good result when two characters are overlapping. Overlapping here means that the second character starts before the first one is finished (looking from a vertical line point of view).

###The pros :

When there is at least one pixel between two characters, it can separate them
Keeps accents on letter (and dot on 'i' and 'j' for example)

###The cons :

Characters have to be 'full' (no missing pixels) overwise the algorithm will separate them
It cannot separated two merged characters.

Oversegmentation

The oversegmentation algorithm is a dynamic algorithm. We first try to detect where the cuts are more likely to appear and then we try all the outcomes. This algorithm is bound to the recognition part as it needs it to weight(ponderate ?) the cuts. That's why the results of this algorithm is strongly dependent on the quality of the recognition algorithm and make it tricky to use.

Vocabulary

There are three kinds of segmentation :

Undersegmentation : The segmentation performed can still group more than one character in a box
Segmentation : The segmentation is always correct. One cut means one character
Oversegmentation : One character can be cut several times

###The pros :

Can segment touching characters (->meaning ?)
As seen in the litterature, gives really good results when the recognition part is handled by a neural network

###The cons :

Due to oversegmentation, it can easily find two words where there was only one ('m' becomes 'rn')
The compute cost grow exponentially if there is a lot of possible cuts
Some heuristics have to be made and we haven't been able to found publications on it for now as there all using a neural network to learn the best way to cut..

Character Recognition

This section presents the recognition process.

Clustering - KMeans

Due to the large number of images of each category, it is necessary to compute K clusters for each label. For each letter, the algorithm gathers closest images in an average image. To compute the different clusters, the KMeans algorithm is used.

Recognition - KNN

After computing the clusters, we compare the image obtained by segmentation to the different clusters. The K identic closest pictures define the label of the character. The MSE distance is used, even if Chamfer distance has been tried with and without skeletonization. (-> I didn't understand, but 'has been tried' seems hazardous)

Further work

Neural Network is the solution for recognition

Post Processing

Some imperfections remain after the recognition, it is why it is necessary to correct the obtained text.

HMM

In order to correct incoherences in the text, one needs to know the language of the text, it is why this software could be used only on english text. The hidden states are the real letters and the observed states are the result of the recogntion. The Viterbi algorithm computes the most probable word respecting the transition between letters in english.

Levenshtein

Even after the HMM, there are some mistakes. The algorithm computes the Levenshtein distance between the result and the dictionnary and replaces the word by the closest word. In order to have this correction, add -DLEVENSHTEIN in the Makefile. (the README is not a garbage can >.<)

Further work

Change the Levenshtein distance in order to allow more insertion of letters.
Take account of the error of over-segmentation in the HMM by creating new states.

DataSet

The dataset used in this project can be found on The Chars74K dataset page. However we add some special characters as '.', ',', ''', '\', '/' ... In order to insert new characters, you have to create a new directory in which you include new images, then you have to change the bijections between the directory's name and the character label.

Libraries

Compilation options

DSKELETONIZATION : Computes the skeltonization on each charcters.
DLEVENSHTEIN : Computes at the end of the HMM analysis the most probable word.
DCHAMFER : Computes the distance of Chamfer on each characters.
DKNN : Computes the KNN algorithm for recognition.

How to install and run OCR ?

Unzip the dataset at the root of the project.
Change the options of compilation in the Makefile.
And type : "make" at the root.

WARNING :

At the first running, there is a long learning phase, which could go on 3 hours.
If you change the compilation options, do "make clean; make" and delete the result_dataset and the hmm_save file.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
Images_Tests		Images_Tests
exec_test		exec_test
header		header
obj		obj
src		src
tests		tests
.gitignore		.gitignore
Doxyfile		Doxyfile
Makefile		Makefile
README.md		README.md
dataset.tar.gz		dataset.tar.gz
dictionnary.txt		dictionnary.txt

specialityProjectEnsimag/OCR

Folders and files

Latest commit

History

Repository files navigation

Authors

Project

Pre Processing

Filters

Binarization

Rotation

Further work

Segmentation algorithms

Sliding window

Overlapping segmentation

Oversegmentation

Vocabulary

Character Recognition

Clustering - KMeans

Recognition - KNN

Further work

Post Processing

HMM

Levenshtein

Further work

DataSet

Libraries

Compilation options

How to install and run OCR ?

About

Resources

Stars

Watchers

Forks

Languages