Question

I am new in python and want to apply p reprocessing steps 
so here is decoding error 

import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.stem import PorterStemmer

`ps=PorterStemmer()
print ("\n Reading file with out stopwords.")
text_file=open('preprocessing.txt',encoding='utf-8').read()
stop_words= set(stopwords.words("english"))
words=word_tokenize(text_file)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
print ("\n Removed stopword.")
print(stop_words)
print ("\n Stemming.")
for w in text_file:
print (ps.stem(w))
print(w)
print(sent_tokenize(text_file))
print ("\n tokenization.")
print(word_tokenize(text_file))
print ("\n part of speech tagging.")
print (pos_tag(words))   `

“我想以特定格式显示结果，但输出为 “，第322行，在解码中（结果，消耗）= self._buffer_decode（数据，self.errors，最终） UnicodeDecodeError：'utf-8'编解码器无法解码位置257中的字节0x92：无效的起始字节”

Answer 1

请尝试使用encoding ='unicode_escape'读取数据。例如

text_file = open（'preprocessing.txt'，encoding ='unicode_escape'）。read（）

这将解决UnicodeDecodeError。它为我工作。

否则，您可以尝试以下

text_file = open（r'preprocessing.txt'，encoding ='unicode_escape'）。read（）

Answer 2

确保您的文件使用 UTF-8 编码。如果没有，请在 Notepad++ 中打开它，转到编码选项卡，然后转换为 UTF-8 并另存为。

UnicodeDecodeError：“ utf-8”编解码器无法解码位置257中的字节0x92：无效的起始字节

2 个答案: