我正在寻找将三元组频率存储在内存中的潜在方法,并通过以下方式动态计算unigram和bigram频率:
给出一个三元组,v,w:
count(v,w)= sum(。,v,w),即所有u
的总和类似地,count(w)= sum(。,w)
这肯定会导致一些失踪的unigrams,例如句子开始标记,但这听起来像生成unigrams和bigrams的有效方法吗?
答案 0 :(得分:3)
是。那可行。您可以通过使自己成为一个小型语料库来检查它,并手动进行计数以确保它是相同的。
from collections import Counter
corpus = [['the','dog','walks'], ['the','dog','runs'], ['the','cat','runs']]
corpus_with_ends = [['<s>','<s>'] + s + ['<e>'] for s in corpus]
trigram_counts = Counter(trigram for s in corpus_with_ends for trigram in zip(s,s[1:],s[2:]))
unique_bigrams = set((b,c) for a,b,c in trigram_counts)
bigram_counts = dict((bigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[1:] == bigram)) for bigram in unique_bigrams)
unique_unigrams = set((c,) for a,b,c in trigram_counts if c != '<e>')
unigram_counts = dict((unigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[2:] == unigram)) for unigram in unique_unigrams)
现在你可以检查一下:
>>> true_bigrams = [bigram for s in corpus_with_ends for bigram in zip(s[1:],s[2:])]
>>> true_bigram_counts = Counter(true_bigrams)
>>> bigram_counts == true_bigram_counts
True
>>> true_unigrams = [(unigram,) for s in corpus_with_ends for unigram in s[2:-1]]
>>> true_unigram_counts = Counter(true_unigrams)
>>> unigram_counts == true_unigram_counts
True