Question

我几乎整理了所有内容，但因为我想要前2k个独特的单词，我会得到一个超级混乱的发行版。我最终会用它来构建一个字典，但我想看看哪些是最常见的2k字，所以我可以为字典选择相关的字。无论如何，请参阅下面的代码。如何修改以获得我看到的单词（单词）（计数）？不必限制在2k，很高兴看到全部数量？谢谢！

>>> fileObj = codecs.open( "/Users/shannonmcgregor/Desktop/ALLstories.txt", "r", "Latin-1" )
chattanooga_stories = fileObj.read()
>>> import nltk
from nltk.corpus import stopwords
>>> lowered_stories = chattanooga_stories.lower()
>>> word_list = lowered_stories.split()
>>> filtered_stories = [w for w in word_list if not w in stopwords.words('english')]
>>> fdist = nltk.FreqDist(w.lower() for w in filtered_stories)
>>> print(fdist)
<FreqDist with 7031 samples and 19893 outcomes>
>>> top_2k = [ ]
>>> top_2k = fdist.most_common(2000)
>>> fdist.plot(2000, cumulative=True)

Answer 1

使用most_common（）时，您可以获得各种单词的计数。使用items方法以排序的顺序获取项目列表（最常见的是第一个）。

fdist = nltk.FreqDist(filtered_stories)    #filtered_stories is already lowercase
print(fdist)
top_2k = [ ]
top_2k = fdist.most_common(2000)
tok_2k.items() #should give you a sorted list

打印术语频率列表（有分发）

1 个答案: