Question

我正在尝试阅读一个csv文件 - 它有300万条推文。最后，我想删除停用词，并获得前2000个独特单词及其频率。但是，在我达到这一点之前，我遇到了一个错误。这是我的代码：

import nltk
from nltk.corpus import stopwords
import csv

f = open("/Users/shannonmcgregor/Desktop/ShanTweets.csv")
shannon_sample_tweets = f.read()
f.close()

filtered_tweets = [w for w in shannon_sample_tweets if not w in stopwords.words('english')]

我跑完后得到的错误是：

__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

任何人都可以帮我弄清楚出了什么问题吗？我确实将# -*- coding: utf-8 -*-,放在源代码的顶部

Answer 1

好，你的评论清楚了。要使您的csv进入unicode，您应该运行：import codecs然后：

f = codecs.open("/Users/shannonmcgregor/Desktop/ShanTweets.csv","r","utf-8")

然后，如果您重新检查csv的类型，您应该看到unicode。这当然假设您的推文符合utf-8，这似乎就是这种情况（我快速浏览了一下！）。如果您计划在Python中使用字符串，我建议您阅读编码 - 它们将对您的工作变得重要。

读取csv文件，删除停用词，找到唯一的单词

1 个答案: