什么是“开始”给出'the'的可能性?

时间:2014-05-07 23:19:38

标签: python nltk corpus tagged-corpus

Using an NLTK Conditional Frequency Distribution and the nltk.bigrams function, train a bigram model on the Genesis:

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
Answer the following questions

What is the Probability of ‘begining’ given ‘the’?
What is the probability of ‘the’?

注意:您作为答案给出的概率必须是可以从该语料库中计算的概率。

你好,有人可以帮帮我吗?这是在nltk书中。当我得到它时,我得到78%没有意义。我试图用Python计算它。

1 个答案:

答案 0 :(得分:0)

probability of 'beginning' intersect 'the'

之间存在某种差异
p('beginning','the')

probability of 'beginning' given 'the'

p('beginning'|'the') = p('beginning','the') / p('the')

尝试:

from collections import Counter

import nltk

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd_bigrams = Counter(bigrams)
cfd_unigrams = Counter(list(text))

print "p('said','unto') =", cfd_bigrams[u'said', u'unto'] / float(sum(cfd_bigrams.values()))

print "p('said'|'unto') =", (cfd_bigrams[u'said', u'unto'] / float(sum(cfd_bigrams.values()))) / cfd_unigrams[u'unto']

print "p('beginning','the') =", cfd_bigrams[u'beginning', u'the']

[OUT]:

p('said','unto') = 0.00397649844738
p('said'|'unto') = 6.73982787691e-06
p('beginning','the') = 0