打印术语频率列表(有分发)

时间:2016-06-21 02:06:44

标签: python nltk

我几乎整理了所有内容,但因为我想要前2k个独特的单词,我会得到一个超级混乱的发行版。我最终会用它来构建一个字典,但我想看看哪些是最常见的2k字,所以我可以为字典选择相关的字。无论如何,请参阅下面的代码。如何修改以获得我看到的单词(单词)(计数)?不必限制在2k,很高兴看到全部数量?谢谢!

>>> fileObj = codecs.open( "/Users/shannonmcgregor/Desktop/ALLstories.txt", "r", "Latin-1" )
chattanooga_stories = fileObj.read()
>>> import nltk
from nltk.corpus import stopwords
>>> lowered_stories = chattanooga_stories.lower()
>>> word_list = lowered_stories.split()
>>> filtered_stories = [w for w in word_list if not w in stopwords.words('english')]
>>> fdist = nltk.FreqDist(w.lower() for w in filtered_stories)
>>> print(fdist)
<FreqDist with 7031 samples and 19893 outcomes>
>>> top_2k = [ ]
>>> top_2k = fdist.most_common(2000)
>>> fdist.plot(2000, cumulative=True)

1 个答案:

答案 0 :(得分:1)

使用most_common()时,您可以获得各种单词的计数。使用items方法以排序的顺序获取项目列表(最常见的是第一个)。

fdist = nltk.FreqDist(filtered_stories)    #filtered_stories is already lowercase
print(fdist)
top_2k = [ ]
top_2k = fdist.most_common(2000)
tok_2k.items() #should give you a sorted list