我目前正在编写一个程序,该程序利用Python NLTK库确定评论是肯定的还是否定的。当尝试标记每个单词并将其存储在数组中时,我不断收到上述错误。错误行之前和之后的代码行是:
from nltk.tokenize import word_tokenize
...
short_pos = open("reviews/pos_reviews.txt", "r").read()
short_neg = open("reviews/neg_reviews.txt", "r").read()
documents = []
for r in short_pos.split('\n'):
documents.append( (r, "pos") )
for r in short_neg.split('\n'):
documents.append( (r, "neg") )
all_words = []
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)
倒数第二行是说我有错误的地方。如果我注释掉该行,则错误出现在下一行。我不确定在哪里会出现此错误,因为我根本不认为我正在使用unicode。任何帮助将不胜感激!
答案 0 :(得分:0)
在Python 2.7中,尝试使用io
模块来指定文件编码,请参见Difference between io.open vs open in python
此外,上下文管理器是您的朋友(即with ... as ...
),尤其是。关于I / O https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/
import io
from nltk.tokenize import word_tokenize
documents = []
with io.open("reviews/pos_reviews.txt", "r", encoding="utf8") as fin:
for line in fin:
documents.append((line.strip(), "pos"))