Question

我已经使用FreqDist来显示文件中出现的每个二元组的数量，outout是一个元组列表，后面跟着它们的计数。如何使用for / while循环来获取具有最高计数的双字母组。

raw=open("ex.txt","r").read()
tokens=nltk.word_tokenize(raw)
words=re.compile('.*[A-Za-z0-9].*')
filtered=[w for w in tokens if words.match(w)]
pairs=nltk.bigrams(filtered)
fdist=nltk.FreqDist(pairs)
type(fdist)

for w1,w2 in fdist.items():
   print w1,w2

输出：

（'有'，'''）6 （'有'，'做'）8 （'in'，'the'）2 ...... .....

如何用计数6,8

提取双字母组合

Answer 1

FreqDist基本上是一个带有一些花哨包装的字典，包括按排序顺序返回keys（参见docs）。

fdist.keys()[:2]

如果要提取值大于例如1的所有键。 4，使用filter：

filter(lambda x: fdist[x] > 4, fdist)

Answer 2

>>> text = """This is a foo bar\nsomething something foo foo bar, that doesn't do nothing!\n"""
>>> from nltk.util import bigrams>>> from nltk.probability import FreqDist>>> from nltk.tokenize import word_tokenize>>> FreqDist(bigrams(word_tokenize(text)))
<FreqDist with 15 samples and 16 outcomes>

>>> for i in x:
...     print i, x[i]
... 
('foo', 'bar') 2
(',', 'that') 1
('This', 'is') 1
('a', 'foo') 1
('bar', ',') 1
('bar', 'something') 1
('do', 'nothing') 1
('does', "n't") 1
('foo', 'foo') 1
('is', 'a') 1
("n't", 'do') 1
('nothing', '!') 1
('something', 'foo') 1
('something', 'something') 1
('that', 'does') 1

使用fdist进行nltk单词对计数

2 个答案: