Android – Custom Dictionary for Tesseract

androiddictionaryocrtesseract

I am currently working on a project for android using Tesseract OCR. I was hoping to fine-tune the results given to the user by adding a dictionary. According to tesseract OCR wiki, the best way to go about this would be to

Replace tessdata/eng.user-words with your own word list, in the same
format – UTF8 text, one word per line.

However there is no eng.user-words file in the tessdata folder, I assume that if I just make a text file with my dictionary in it, it will never be used…

Has anybody had a similar experience and knows what to do?

Best Answer

If you're using tesseract 3 (which I assume you are). You'll have to rebuild your eng.trainddata file.

I intended to replace the word-dawg file completely to try to get better results (ie - the words I'm detecting are always the same).

You'll need combine_tessdata and wordlist2dawg executables in the training directory when you compile tesseract.

unpack everything (i did this just to back up my eng.word-dawg, you'll also need the unicharset later)

./combine_tessdata -u eng.traineddata
create a textfile of your wordlist (wordlistfile)
create a eng.word-dawg

./wordlist2dawg wordlistfile eng.word-dawg traineddat_backup/.unicharset
replace the word-dawg file

./combine_tessdata -o eng.traineddata eng.word-dawg

that should be it.

Python 3.7+ or CPython 3.6

Dicts preserve insertion order in Python 3.7+. Same in CPython 3.6, but it's an implementation detail.

>>> x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
>>> {k: v for k, v in sorted(x.items(), key=lambda item: item[1])}
{0: 0, 2: 1, 1: 2, 4: 3, 3: 4}

>>> dict(sorted(x.items(), key=lambda item: item[1]))
{0: 0, 2: 1, 1: 2, 4: 3, 3: 4}

Older Python

It is not possible to sort a dictionary, only to get a representation of a dictionary that is sorted. Dictionaries are inherently orderless, but other types, such as lists and tuples, are not. So you need an ordered data type to represent sorted values, which will be a list—probably a list of tuples.

For instance,

import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(1))

sorted_x will be a list of tuples sorted by the second element in each tuple. dict(sorted_x) == x.

And for those wishing to sort on keys instead of values:

import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(0))

In Python3 since unpacking is not allowed we can use

x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=lambda kv: kv[1])

If you want the output as a dict, you can use collections.OrderedDict:

import collections

sorted_dict = collections.OrderedDict(sorted_x)

Best Answer

Related Solutions

Python – How to sort a list of dictionaries by a value of the dictionary

Python – How to sort a dictionary by value

Python 3.7+ or CPython 3.6

Older Python

Related Topic