我正在使用此代码来获取bigrams的频率:
text1='the cat jumped over the dog in the dog house'
text=text1.split()
counts = defaultdict(int)
for pair in nltk.bigrams(text):
counts[pair] +=1
for c, pair in ((c, pair) for pair, c in counts.iteritems()):
print pair, c
输出结果为:
('the', 'cat') 1
('dog', 'in') 1
('cat', 'jumped') 1
('jumped', 'over') 1
('in', 'the') 1
('over', 'the') 1
('dog', 'house') 1
('the', 'dog') 2
我需要的是列出的双胞胎,但是我需要将单词的等级打印出来,而不是每个单词。当我的意思是“等级”时,我的意思是频率最高的单词有等级1,第二等级有等级2等等......这里的等级是:1.the 2.dog和频率相等的等级按降序排列等级。 3.cat 4.jumped 5.over等..
例如
1 3 1
而不是
('the', 'cat') 1
我相信要做到这一点,我需要一本带有单词和等级的字典,但我被困住了,不知道如何继续。我所拥有的是:
fd=FreqDist()
ranks=[]
rank=0
for word in text:
fd.inc(word)
for rank, word in enumerate(fd):
ranks.append(rank+1)
word_rank = {}
for word in text:
word_rank[word] = ranks
print ranks
答案 0 :(得分:3)
假设已经创建了counts
,以下内容应该得到您想要的结果:
freq = defaultdict(int)
for word in text:
freq[word] += 1
ranks = sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k)))
ranks = dict(zip(ranks, range(1, len(ranks)+1)))
for (a, b), count in counts.iteritems():
print ranks[a], ranks[b], count
输出:
1 3 1
2 6 1
3 4 1
4 5 1
6 1 1
5 1 1
2 7 1
1 2 2
以下是一些可能有助于理解其工作原理的中间值:
>>> dict(freq)
{'house': 1, 'jumped': 1, 'over': 1, 'dog': 2, 'cat': 1, 'in': 1, 'the': 3}
>>> sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k)))
['the', 'dog', 'cat', 'jumped', 'over', 'in', 'house']
>>> dict(zip(ranks, range(1, len(ranks)+1)))
{'house': 7, 'jumped': 4, 'over': 5, 'dog': 2, 'cat': 3, 'in': 6, 'the': 1}
答案 1 :(得分:0)
text1='the cat jumped over the dog in the dog house'.split(' ')
word_to_rank={}
for i,word in enumerate(text1):
if word not in word_to_rank:
word_to_rank[word]=i+1
from collections import Counter
word_to_frequency=Counter(text1)
word_to_tuple={}
for word in word_to_rank:
word_to_tuple[word]=(-word_to_frequency[word],word_to_rank[word])
tuple_to_word=dict(zip(word_to_tuple.values(),word_to_tuple.keys()))
sorted_by_conditions=sorted(tuple_to_word.keys())
word_to_true_rank={}
for i,_tuple in enumerate(sorted_by_conditions):
word_to_true_rank[tuple_to_word[_tuple]]=i+1
def fix(pair,c):
return word_to_true_rank[pair[0]],word_to_true_rank[pair[1]],c
pair=('the', 'cat')
c=1
print fix(pair,c)
pair=('the', 'dog')
c=2
print fix(pair,c)
>>>
(1, 3, 1)
(1, 2, 2)