Question

我是python的新手，我正在努力研究一大块Yelp！使用 pandas 库和 NLTK ，使用JSON但我转换为CSV的数据集。

在进行数据预处理时，我首先尝试删除所有标点符号以及最常见的停用词。在这之后，我想应用Porter Stemming算法，该算法在 nltk.stem 中很容易获得。

这是我的代码：

"""A method for removing the noise in the data and the most common stop.words (NLTK)."""
def stopWords(review):

    stopset = set(stopwords.words("english"))
    review = review.lower()
    review = review.replace(".","")
    review = review.replace("-"," ")
    review = review.replace(")","")
    review = review.replace("(","")
    review = review.replace("i'm"," ")
    review = review.replace("!","")
    review = re.sub("[$!@#*;:<+>~-]", '', review)
    row = review.split()

    tokens = ' '.join([word for word in row if word not in stopset])
    return tokens

我在这里用令牌输入我写的词干方法：

"""A method for stemming the words to their roots using Porter Algorithm (NLTK)"""
def stemWords(impWords):
    stemmer = stem.PorterStemmer()
    tok = stopWords(impWords)
    ========================================================================
    stemmed = " ".join([stemmer.stem(str(word)) for word in tok.split(" ")])
    ========================================================================
    return stemmed

但我收到错误UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data。 '=='里面的行给了我错误。

我已经尝试清理数据并删除所有特殊字符！@＃$ ^＆amp; *和其他人来完成这项工作。但是停止措辞正常。堵塞不起作用。有人可以告诉我，我做错了吗？

如果我的数据不干净，或unicode字符串在某个地方破坏，我可以用任何方式清理它或修复它，它不会给我这个错误吗？我想干预，任何建议都会有所帮助。

Answer 1

阅读python中的unicode字符串处理。类型为str，但也有unicode类型。

我建议：

读取后立即解码每一行，以缩小输入数据中的错误字符（实际数据包含错误）
在任何地方使用unicode和u" "字符串。

UnicodeDecodeError意外结束数据同时阻塞数据集

1 个答案: