Question

我想计算几个csv文件中出现的单词。首先，我想展示10个最常出现的单词，其中有停用词然后没有停用词。

这是我的代码：

import nltk
nltk.download("stopwords")


from nltk.corpus import stopwords


myfile = sc.textFile('./Sacramento*.csv')


counts = myfile.flatMap(lambda line: line.split(",")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)


sorted_counts = counts.map(lambda (a, b): (b, a)).sortByKey(0, 1).map(lambda (a, b): (b, a))


first_ten = sorted_counts.take(10)


first_ten
Out[7]:
[(u'Residential', 917),
 (u'2', 677),
 (u'CA', 597),
 (u'3', 545),
 (u'SACRAMENTO', 439),
 (u'ours', 388),
 (u'0', 387),
 (u'4', 277),
 (u'Mon May 19 00:00:00 EDT 2008', 268),
 (u'Fri May 16 00:00:00 EDT 2008', 264)]


cachedStopWords = stopwords.words("english")


result_ll = counts.map(lambda (a, b): (b, a)).sortByKey(0,
1).map(lambda (a, b): (b, a))


print [i for i in result_ll.take(10) if i not in cachedStopWords]

但输出仍然有停用词 - “我们的”也在停用词之间

[（u'Residential'，917），（u'2'，677），（u'CA'，597），（u'3'，545），（u'SACRAMENTO'，439），（ u'ours'，388），（u'0'，387），（u'4'，277），（u'Mon May 19 00:00:00 EDT 2008'，268），（u'Fri May 16 00:00:00美国东部时间2008'，264）]

我应该如何更改我的代码，以便输出没有停用词：“我们的”？

Answer 1

你在最后一行有错误，应该是

print [i for i in result_ll.take(10) if i[0] not in cachedStopWords]

因为i[0]包含实际的字词。

计算没有停用词的多个csv文件中的单词频率

1 个答案: