Python – What Does the Codecs Module Do?

character encodingpython

I just read through the documentation on the Codecs module, but I guess my knowledge/experience of comp sci doesn't run deep enough yet for me to comprehend it.

It's for dealing with encoding/decoding, especially Unicode, and while to many of you that's a complete and perfect explanation, to me that's really vague. I don't really understand what it means at all. Text is text is text in a .txt file? Or so this amateur thought.

Can anyone explain?

Best Answer

Having understood why Unicode is necessary (recommended reading: link, link :thanks @DanielB for the excellent links) and that character encodings are what computers use to represent real world characters, it becomes clear that when Python is reading or writing bytes representing text from a stream (which can be a file, a pipe, a socket, ...), it needs to know which character encoding is being used so those bytes are meaningful as human readable text.

Python uses ascii as the default encoding. The ascii encoding is inadequate for almost all languages (including English: there's no £ in ascii!): so you need to specify an alternate encoding when writing to streams if you intended to use any characters not part of ascii.

When reading from streams, you need to know which encoding was used to write to the stream and use the same encoding to read from it, otherwise the decoded result will be wrong. Try writing a string with Cyrillic characters with the ISO-8859-5 codec and reading it back with the UTF-8 codec: you'll see they don't match, because different byte sequences mean different characters in the two encodings.

So to answer your specific question,

out_file = open('example.txt', 'w')

is actually opening the file for writing using the ascii codec implicitly. If you want to specify another codec, you need to either use the encoding parameter of the open function in Python 3.x:

out_file = open('example.txt', 'a', 'utf-8')

or if you are still using Python 2.x (the latest is Python 2.7.3 at the time of this writing), you need to use the functions from the codecs module:

out_file = codecs.open('example.txt', 'a', 'utf-8')

since open in Python 2.x doesn't allow you to specify an encoding (you can use it to read the byte stream into a byte string in memory though, and then decode that string).

Related Topic