使用fdist进行nltk单词对计数

时间:2014-01-25 13:19:30

标签: python nltk

我已经使用FreqDist来显示文件中出现的每个二元组的数量,outout是一个元组列表,后面跟着它们的计数。如何使用for / while循环来获取具有最高计数的双字母组。

raw=open("ex.txt","r").read()
tokens=nltk.word_tokenize(raw)
words=re.compile('.*[A-Za-z0-9].*')
filtered=[w for w in tokens if words.match(w)]
pairs=nltk.bigrams(filtered)
fdist=nltk.FreqDist(pairs)
type(fdist)

for w1,w2 in fdist.items():
   print w1,w2

输出:

('有',''')6 ('有','做')8 ('in','the')2 ...... .....

如何用计数6,8

提取双字母组合

2 个答案:

答案 0 :(得分:1)

FreqDist基本上是一个带有一些花哨包装的字典,包括按排序顺序返回keys(参见docs)。

fdist.keys()[:2]

如果要提取值大于例如1的所有键。 4,使用filter

filter(lambda x: fdist[x] > 4, fdist)

答案 1 :(得分:0)

>>> text = """This is a foo bar\nsomething something foo foo bar, that doesn't do nothing!\n"""
>>> from nltk.util import bigrams>>> from nltk.probability import FreqDist>>> from nltk.tokenize import word_tokenize>>> FreqDist(bigrams(word_tokenize(text)))
<FreqDist with 15 samples and 16 outcomes>

>>> for i in x:
...     print i, x[i]
... 
('foo', 'bar') 2
(',', 'that') 1
('This', 'is') 1
('a', 'foo') 1
('bar', ',') 1
('bar', 'something') 1
('do', 'nothing') 1
('does', "n't") 1
('foo', 'foo') 1
('is', 'a') 1
("n't", 'do') 1
('nothing', '!') 1
('something', 'foo') 1
('something', 'something') 1
('that', 'does') 1