Question

有很多类似的问题，我已经尝试了所有可能的解决方案，但似乎无法解决问题。这是我的代码，我正在使用Stanford Tagger进行名称实体识别。

from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('stanford-ner\classifiers\english.all.3class.distsim.crf.ser.gz',
                   'stanford-ner\stanford-ner.jar', encoding='utf-8')
tuple_list = st.tag("Please pay €94 million.".split())
print(tuple_list)

这是我得到的错误。

Traceback (most recent call last):
File "C:/Users/Dell/PycharmProjects/CSSOP/ner2.py", line 4, in <module>
tuple_list = st.tag("He was the subject of the most expensive association football transfer when he moved from Manchester United to Real Madrid in 2009 in a transfer worth €94 million ($132 million).".split())
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 71, in tag
return sum(self.tag_sents([tokens]), []) 
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 95, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 247: invalid start byte

编辑：这不是文件打开编码问题，如其他类似问题所述。

Answer 1

当nltk的斯坦福包装器尝试回读斯坦福识别器（这是一个java程序）的输出时，您收到解码错误。显然，识别器已设法创建无效的utf-8文件。显然，它在写出之前不会检查您传递的数据，因此只有在Python尝试重新读取时才会发现问题。

现在，在this table的最顶端，您会看到0x80是Windows 1252代码页对欧元符号进行编码的方式。含义很明确：您的Python源代码使用Windows 1252编码，因此您的字符串文字包含的内容。这里正确的解决方案是将编辑器切换为使用UTF-8，并修复程序的编码。

如果你正在使用Python 2，这种行为是有意义的;但你的代码片段似乎是Python 3（print的函数形式），所以请在冒险尝试替代修复之前进行澄清。

UnicodeDecodeError：＆＃39; utf-8＆＃39;编解码器不能解码字节0x80

1 个答案: