I have several low quality pdfs. I would like to use OCR — to be more precise Ocropus to get text from them. To do use, I use first ImageMagick — a command line tool to convert pdf to images — to transforms these pdfs into jpg or png.
However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling low quality pdfs to provide as-good-as-possible-quality images to OCR.
I have found this page, but I do not know where to start.
Best Answer
You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing
(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:
Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:
You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:
Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.
About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:
-density 600
which tells Ghostscript to use a 600 dpi resolution for its image output.PDF => PS
and thenPS => PNG
is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:PS => PDF
is not that critical....)That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:
(This is the commandline for Windows. On Linux, use
gs
instead ofgswin32c.exe
, and\
instead of^
.) This command expects to find anoutput
subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.
[*] D'oh! I missed to see your "linux" tag at first...