Linux – pdftotext not outputting hebrew characters

encodinglinuxpdf

I'm using Xpdf's pdftotext to get the text out of some hebrew pdf files on Ubuntu.

On my local machine this worked fine. I then tried to do it on another machine and the hebrew characters don't show up in the text file. I verified that I have the language package (see below why I think so). Where else can I look for the problem?

>> tail -2 /etc/xpdf/xpdfrc
include /etc/xpdf/includes

>> cat /etc/xpdf/includes
# This file was automatically generated by /usr/sbin/update-xpdfrc.
# Instead, add or remove files in /etc/xpdf/ then run
# /usr/sbin/update-xpdfrc to regenerate this file.
include /etc/xpdf/xpdfrc-latin2
include /etc/xpdf/xpdfrc-thai
include /etc/xpdf/xpdfrc-greek
include /etc/xpdf/xpdfrc-turkish
include /etc/xpdf/xpdfrc-arabic
include /etc/xpdf/xpdfrc-hebrew
include /etc/xpdf/xpdfrc-cyrillic

>> cat /etc/xpdf/xpdfrc-hebrew
#----- begin Hebrew support package (2003-feb-16)
unicodeMap  ISO-8859-8  /usr/share/xpdf/hebrew/ISO-8859-8.unicodeMap
unicodeMap  Windows-1255    /usr/share/xpdf/hebrew/Windows-1255.unicodeMap
#----- end Hebrew support package

>> ls /usr/share/xpdf/hebrew/
ISO-8859-8.unicodeMap  Windows-1255.unicodeMap

Best Answer

Luckily, the friendly Ubuntu people made it easy to install languages. Simply enter this command into your shell:

sudo apt-get install language-support-he language-pack-he

You will notice it adds hebrew support to quite a few other sub-systems (such as HSpell, Myspell and PostgreSQL for example), and installs some Hebrew fonts as well.

For good measure, install the following hebrew fonts:

sudo apt-get install culmus culmus-fancy xfonts-efont-unicode xfonts-efont-unicode-ib xfonts-intl-european msttcorefonts

And finally, make sure that when you run pdftotext, that you specify the UTF-8 encoding format, as it may not detect your source automatically:

pdftotext -enc UTF-8 input.pdf output.txt