Question

我正在编写用于拦截推文的代码，但我遇到了编码问题。当我尝试应用porter stemmer时，它显示错误。也许我无法正确标记它。

我的代码如下......

import sys
import pandas as pd
import nltk
import scipy as sp
from nltk.classify import NaiveBayesClassifier
from nltk.stem import PorterStemmer
reload(sys)  
sys.setdefaultencoding('utf8')


stemmer=nltk.stem.PorterStemmer()

p_test = pd.read_csv('TestSA.csv')
train = pd.read_csv('TrainSA.csv')

def word_feats(words):
    return dict([(word, True) for word in words])

for i in range(len(train)-1):
    t = []
    #train.SentimentText[i] = " ".join(t)
    for word in nltk.word_tokenize(train.SentimentText[i]):
        t.append(stemmer.stem(word))
    train.SentimentText[i] = ' '.join(t)

当我尝试执行它时，返回错误：

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-5aa856d0307f> in <module>()
     23     #train.SentimentText[i] = " ".join(t)
     24     for word in nltk.word_tokenize(train.SentimentText[i]):
---> 25         t.append(stemmer.stem(word))
     26     train.SentimentText[i] = ' '.join(t)
     27 

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in stem(self, word)
    631     def stem(self, word):
    632         stem = self.stem_word(word.lower(), 0, len(word) - 1)
--> 633         return self._adjust_case(word, stem)
    634 
    635     ## --NLTK--

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in _adjust_case(self, word, stem)
    602         for x in range(len(stem)):
    603             if lower[x] == stem[x]:
--> 604                 ret += word[x]
    605             else:
    606                 ret += stem[x]

/usr/lib64/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

任何人都有任何线索，我的代码是错的。我坚持这个错误。任何建议......？

Answer 1

我认为关键线是604，在引发错误的地方上方一帧：

--> 604                 ret += word[x]

可能ret是一个Unicode字符串，word是一个字节字符串。并且你无法逐字节解码UTF-8，因为该循环正在尝试。

问题是read_csv返回字节，并且您正在尝试对这些字节进行文本处理。这根本不起作用，必须首先将这些字节解码为Unicode。我想你可以使用：

pandas.read_csv(filename, encoding='utf-8')

如果可能，请使用Python 3.然后尝试连接字节和unicode将始终引发错误，从而更容易发现这些问题。

UnicodeDecodeError：＆＃39; utf8＆＃39;编解码器不能解码位置0中的字节0xc3：意外的数据结束

1 个答案: