Question

在最后一行vocabulary中运行几行代码后，我得到了输出。它给了我46132个不同的单词，并告诉我每个单词出现在文档中的次数。

我在输出下方附上了屏幕截图。我不确定vocabulary是哪种格式。我需要提取出现在文档中的10个最频繁和10个最不频繁的单词。我不确定该怎么做，可能是因为我不知道输出的格式是str还是tuple。

我可以只使用max(vocabulary)来使文档中出现频率最高的单词吗？ sorted(vocabulary)并获得前10个和最后10个，作为文档中出现的10个最频繁和10个最不频繁的单词？

Answer 1

使用collections.Counter类获得k最常用的单词很简单：

>>> vocabulary = { 'apple': 7, 'ball': 1, 'car': 3, 'dog': 6, 'elf': 2 }
>>> from collections import Counter
>>> vocabulary = Counter(vocabulary)
>>> vocabulary.most_common(2)
[('apple', 7), ('dog', 6)]

获取最不常用的单词也比较麻烦。最简单的方法可能是按值对字典的键/值对进行排序，然后进行切片：

>>> sorted(vocabulary.items(), key=lambda x: x[1])[:2]
[('ball', 1), ('elf', 2)]

由于您同时需要两者，因此不妨排序一次并切成两片；这样，您就无需使用Counter：

>>> sorted_vocabulary = sorted(vocabulary.items(), key=lambda x: x[1])
>>> most_common = sorted_vocabulary[-2:][::-1]
>>> least_common = sorted_vocabulary[:2]

如何在python中提取10个最频繁和10个最不频繁的单词？

1 个答案: