Understanding the encoding scheme in python 3

python-3.xunicode

I got this error in my program which grab data from different website and write them to a file:

'charmap' codec can't encode characters in position 151618-151624: character maps to <undefined>

I am not familiar with all the encoding decoding thing and I have been okay with what python 2 did. Although the python officially said that they made the change in order to make things better, it seems to get worse.

I have no idea how to fix these errors. However I am a pro-active person so I would really like to know what is causing the problem and how to solve it. I have check the official site but the words are hard to understand.

Could I have a simple elaboration on that? Another page is also acceptable.

EDIT:
I've check this page, the Unicode HOWTO in Python 2.7. My understanding is that we must translate the unicode string into binary format while we're writing it to file and it require an encoding. Obviously 'utf-8' is the best one, but why it didn't force the python interpreter to use 'utf-8'? Instead, it use some strange codec such as 'charmap' and 'cp950' and 'latin-1'.

Best Answer

See the Python wiki on the subject. You are trying to encode a unicode string (the default string type in Python 3) with an encoding that doesn't support some of the characters in your string.

>>> '\u0411'.encode("iso-8859-15")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/encodings/iso8859_15.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\u0411' in position 0: character maps to <undefined>
>>> '\u0411\u0411\u0411\u0411\u0411'.encode("iso-8859-15")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/encodings/iso8859_15.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>

As you can see, some encodings use a charactermap internally, and detect encoding errors for multiple characters at once.

You'll have to narrow it down to the exact code points (characters in position 151618-151624) and the encoding used.

Related Topic