UnicodeDecodeError:“ utf-8”编解码器无法解码位置257中的字节0x92:无效的起始字节

时间:2019-04-24 05:17:41

标签: python-3.x

I am new in python and want to apply p reprocessing steps 
so here is decoding error 

import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.stem import PorterStemmer

`ps=PorterStemmer()
print ("\n Reading file with out stopwords.")
text_file=open('preprocessing.txt',encoding='utf-8').read()
stop_words= set(stopwords.words("english"))
words=word_tokenize(text_file)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
print ("\n Removed stopword.")
print(stop_words)
print ("\n Stemming.")
for w in text_file:
print (ps.stem(w))
print(w)
print(sent_tokenize(text_file))
print ("\n tokenization.")
print(word_tokenize(text_file))
print ("\n part of speech tagging.")
print (pos_tag(words))   `

“我想以特定格式显示结果,但输出为    “,第322行,在解码中    (结果,消耗)= self._buffer_decode(数据,self.errors,最终)    UnicodeDecodeError:'utf-8'编解码器无法解码位置257中的字节0x92:    无效的起始字节”

2 个答案:

答案 0 :(得分:0)

请尝试使用encoding ='unicode_escape'读取数据。例如

text_file = open('preprocessing.txt',encoding ='unicode_escape')。read()

这将解决UnicodeDecodeError。它为我工作。

否则,您可以尝试以下

text_file = open(r'preprocessing.txt',encoding ='unicode_escape')。read()

答案 1 :(得分:0)

确保您的文件使用 UTF-8 编码。如果没有,请在 Notepad++ 中打开它,转到编码选项卡,然后转换为 UTF-8 并另存为。