Python encoding problem

encodingpythonshell

I have a python script that exports data from a mysql utf-8 table into an textfile. Here is the code that does the job

csvDatei = codecs.open( csvDateiName, "w", "utf-8" )
...
cursor = db.cursor();
sql = "select * from %s.%s;" % (dbAusgang, tabelle)
cursor.execute(sql);
...
daten = cursor.fetchall();
for i in xrange(len(daten)):
    line = '';
    for j in xrange(len(daten[i])):
        line += '"%s";' % unicode(daten[i][j]);
    line = line[:-1];
    line += '\n';
    csvDatei.write(line);
csvDatei.close();

I have also tried this

line += '"%s";' % str(daten[i][j]);

and

line += '"%s";' % daten[i][j];

And now the part I do not understand:

Normally this script should be invoked by a cron job. But when I read a varchar from a table that contains an umlaut like ä,ö or ü the script simply terminates. I checked that by piping the output of the script into a file.

Therefore I tested the script by invoking it manually on the shell by simply typing "python myscript.py" and it runs perfectly fine without any problems.

So my guess is that the problem does not really lie in the script itself but rather in the cron enviroment somehow.

Hopefully anybody of you can give me advice. I'm completly confused.

Any help is appreciated.

—————- Answer to comment 1:

Thanks for the hint with the locale.

First I wrote "locale" on the standard shell. It gave me the following output:

dhl@srv1093:~$ locale
LANG=de_DE.UTF-8
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=de_DE.UTF-8

Then I edited the cron file with "crontab -e" and added the folloing line

*/1 * * * * locale > /home/user/locale.ouput

The output of this cronjob is:

dhl@srv1093:~$ cat locale.ouput 
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

So, this might be the problem? How can I fix that?

Best Answer

You decode the lines to unicode with unicode(daten[i][j]). When you give no encoding then Python uses the system default, which is probably ascii when you run the script through cron.

In either way, you must give the actual encoding used by the database. You can use unicode(daten[i][j], dbencoding) instead, or get your database adapter to give you unicode directly.

Btw: There are probably a million tools that generate cvs files from database queries, MySQL has that even built-in. Your code on the other hand is rather fragile because you do no escaping whatsoever.