我想使用以下方法计算某些文档中使用的单词的频率:
Counter(word.rstrip(punctuation) for word in words).most_common(10)
我无法简单地将.subtract(exclusion_list)添加到此命令,其中exclusion_list是我不想要的单词列表。如何在不包含排除列表的情况下获得前十个单词?
答案 0 :(得分:0)
要在排除列表中获取不是的前10个单词,那么这应该有效:
Counter(word.rstrip(punctuation) for word in words if word not in exclusion_list).most_common(10)
否则,如果由于某种原因你想获得前10个单词而然后排除排除列表中的单词,那么这应该有效:
[w for w in Counter(word.rstrip(punctuation) for word in words).most_common(10) if w[0] not in exclusion_list]
答案 1 :(得分:0)
您可以使用list comprehension
:
>>> words = ('proper prefix '+'1 2 3 4 5 6 7 8 9 A '*10+' proper suffix').split()
>>> exclusion_list = '1 3 5 7 9'.split()
>>> [w for w, c in Counter(words).most_common(10) if w not in exclusion_list]
['A', '2', '4', '6', '8']
如果您希望单词的元组与其计数相匹配:
>>> [(w, c) for w, c in Counter(words).most_common(10) if w not in exclusion_list]
[('A', 10), ('2', 10), ('4', 10), ('6', 10), ('8', 10)]
filter
的另一种方式:
>>> filter(lambda wc: wc[0] not in exclusion_list, Counter(words).most_common(10))
[('A', 10), ('2', 10), ('4', 10), ('6', 10), ('8', 10)]