Tesseract confuses two numbers

I'm writing an application to scan numbers from an image.

The numbers are using the OCR-B font and may also contain + and > characters.

This is my source image:

source image

The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.

I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.

Then I did all steps described here to create the other necessary files.

Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was

$ tesseract esr2c.tif ocrb-esr2c -l ocrb

and the output for the source image was

0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20

If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).

How could this happen? Did I do some mistake in the training process? How can I fix it?

Best Answer

Related Topic

Best Answer

Related Solutions

Image processing to improve tesseract OCR accuracy

Python – “Adding” new fonts to Tesseract eng.traineddata

Related Topic