在NLTK中读取自定义创建的语料库时出现UnicodeDecodeError

时间:2016-09-16 07:11:03

标签: python character-encoding nltk

我使用nltk模块制作了用于检测句子极性的自定义语料库。这是语料库的层次结构:

极性
--polar
---- polar_tweets.txt
--nonpolar
---- nonpolar_tweets.txt

以下是我在源代码中导入该语料库的方法:

polarity = LazyCorpusLoader('polar', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(polar|nonpolar)/.*', encoding='utf-8')
corpus = polarity
print(corpus.words(fileids=['nonpolar/non-polar.txt']))

但它引发了以下错误:

Traceback (most recent call last):
  File "E:/Analytics Practice/Social Media Analytics/analyticsPlatform/DataAnalysis/SentimentAnalysis/data/training_testing_data.py", line 9, in <module>
    print(corpus.words(fileids=['nonpolar/nonpolar_tweets.txt']))
  File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\util.py", line 765, in __repr__
    for elt in self:
  File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\corpus\reader\util.py", line 291, in iterate_from
    tokens = self.read_block(self._stream)
  File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\corpus\reader\plaintext.py", line 122, in _read_word_block
    words.extend(self._word_tokenizer.tokenize(stream.readline()))
  File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1135, in readline
    new_chars = self._read(readsize)
  File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1367, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1398, in _incr_decode
    return self.decode(bytes, 'strict')
  File "C:\Users\prabhjot.rai\AppData\Local\Continuum\Anaconda3\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 269: invalid continuation byte

在创建文件polar_tweets.txtnonpolar_tweets.txt时,我正在将文件uncleaned_polar_tweets.txt解码为utf-8,然后将其写入文件polar_tweets.txt。这是代码:

with open(path_to_file, "rb") as file:
    output_corpus = clean_text(file.read().decode('utf-8'))['cleaned_corpus']

output_file = open(output_path, "w")
output_file.write(output_corpus)
output_file.close()

其中output_file是polar_tweets.txtnonpolar_tweets.txt。 错误在哪里?由于我最初在utf-8进行编码,然后在行中<{1}}进行读取

utf-8

如果我将polarity = LazyCorpusLoader('polar', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(polar|nonpolar)/.*', encoding='utf-8') 替换为encoding='utf-8',则代码可以完美运行。问题在哪里?创建语料库时是否还需要在encoding='latin-1'中解码?

1 个答案:

答案 0 :(得分:1)

您需要了解在Python的模型中,unicode是一种数据,但utf-8编码。他们不是同样的事情。您正在阅读原始文本,这显然是在utf-8;清理它,然后将其写入新的语料库而不指定编码。所以你要把它写出来......谁知道什么是编码。不要发现,只需清理并再次生成语料库,指定utf-8编码。

我希望你能在Python 3中完成所有这些工作;如果没有,请在此处停止并切换到Python 3.然后写出这样的语料库:

output_file = open(output_path, "w", encoding="utf-8")
output_file.write(output_corpus)
output_file.close()