Python Nltk :UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position 50: invalid continuation byte

encodingnltkpythonunicodeutf-8

Traceback (most recent call last):
  File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/abc", line 11, in <module>
    docs2 = [[w.lower() for w in doc]for doc in docs]
  File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/abc", line 11, in <listcomp>
    docs2 = [[w.lower() for w in doc]for doc in docs]
  File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/", line 11, in <listcomp>
    docs2 = [[w.lower() for w in doc]for doc in docs]
  File "C:\Python34\lib\site-packages\nltk\corpus\reader\util.py", line 291, in iterate_from
['PROJECT', 'FINAL', 'REPORT', 'Revision', 'History', 'Date', 'Version', 'Author', 'Validated', 'by', 'Purpose', '4', '-', 'Dec', '-', '13', '0', '.', '1', 'EA', 'Initial', 'Document', '1', '/', '8', '/', '2014', '0', '.', '2', 'EA', '&', 'AHE', 'Combined', 'the', 'copy', 'for', 'both', 'MOE', 'and', 'MOA', '.', '1', '/', '8', '/', '2014', '0', '.', '3']
    tokens = self.read_block(self._stream)
  File "C:\Python34\lib\site-packages\nltk\corpus\reader\plaintext.py", line 117, in _read_word_block
    words.extend(self._word_tokenizer.tokenize(stream.readline()))
  File "C:\Python34\lib\site-packages\nltk\data.py", line 1095, in readline
    new_chars = self._read(readsize)
  File "C:\Python34\lib\site-packages\nltk\data.py", line 1322, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "C:\Python34\lib\site-packages\nltk\data.py", line 1352, in _incr_decode
    return self.decode(bytes, 'strict')
  File "C:\Python34\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 50: invalid continuation byte

I am trying to perform preprocessing of text using NLTK. However i keep running into this error. Some thoughts would be helpful

Best Answer

Some lines of code would be useful. However, my intuition says your corpus reader object should deal with another encoding rather than utf8, probably latin-1.

corpus = nltk.corpus.reader.PlaintextCorpusReader(
    "/path/to/files", r'.*', encoding='latin-1')

See also here: UnicodeDecodeError, invalid continuation byte