我正在编写用于拦截推文的代码,但我遇到了编码问题。当我尝试应用porter stemmer时,它显示错误。也许我无法正确标记它。
我的代码如下......
import sys
import pandas as pd
import nltk
import scipy as sp
from nltk.classify import NaiveBayesClassifier
from nltk.stem import PorterStemmer
reload(sys)
sys.setdefaultencoding('utf8')
stemmer=nltk.stem.PorterStemmer()
p_test = pd.read_csv('TestSA.csv')
train = pd.read_csv('TrainSA.csv')
def word_feats(words):
return dict([(word, True) for word in words])
for i in range(len(train)-1):
t = []
#train.SentimentText[i] = " ".join(t)
for word in nltk.word_tokenize(train.SentimentText[i]):
t.append(stemmer.stem(word))
train.SentimentText[i] = ' '.join(t)
当我尝试执行它时,返回错误:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-10-5aa856d0307f> in <module>()
23 #train.SentimentText[i] = " ".join(t)
24 for word in nltk.word_tokenize(train.SentimentText[i]):
---> 25 t.append(stemmer.stem(word))
26 train.SentimentText[i] = ' '.join(t)
27
/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in stem(self, word)
631 def stem(self, word):
632 stem = self.stem_word(word.lower(), 0, len(word) - 1)
--> 633 return self._adjust_case(word, stem)
634
635 ## --NLTK--
/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in _adjust_case(self, word, stem)
602 for x in range(len(stem)):
603 if lower[x] == stem[x]:
--> 604 ret += word[x]
605 else:
606 ret += stem[x]
/usr/lib64/python2.7/encodings/utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data
任何人都有任何线索,我的代码是错的。我坚持这个错误。任何建议......?
答案 0 :(得分:3)
我认为关键线是604,在引发错误的地方上方一帧:
--> 604 ret += word[x]
可能ret
是一个Unicode字符串,word
是一个字节字符串。并且你无法逐字节解码UTF-8,因为该循环正在尝试。
问题是read_csv
返回字节,并且您正在尝试对这些字节进行文本处理。这根本不起作用,必须首先将这些字节解码为Unicode。我想你可以使用:
pandas.read_csv(filename, encoding='utf-8')
如果可能,请使用Python 3.然后尝试连接字节和unicode将始终引发错误,从而更容易发现这些问题。