NLTK包估计(unigram)困惑

时间:2015-10-21 18:48:31

标签: python-2.7 nlp nltk n-gram language-model

我正在尝试计算我所拥有数据的困惑。我正在使用的代码是:

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

但我收到了错误,

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

我已经为我的数据执行了潜在Dirichlet分配,并且我已经生成了unigrams及其各自的概率(它们被归一化为数据的总概率之和为1)。

我的unigrams和他们的概率看起来像:

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

这只是我所拥有的unigrams文件的一个片段。对于大约1000行,遵循相同的格式。总概率(第二列)总和得出1.

我是一个崭露头角的程序员。这个ngram.py属于nltk包,我很困惑如何纠正这个问题。我这里的示例代码来自nltk文档,我现在不知道该怎么做。请帮忙我能做些什么。提前谢谢!

2 个答案:

答案 0 :(得分:13)

困惑是测试集的反向概率,由字数标准化。在unigrams的情况下:

enter image description here

现在你说你已经构建了unigram模型,意思是,对于每个单词你都有相关的概率。然后你只需要应用公式。我假设你有一个大字典unigram[word],它可以提供语料库中每个单词的概率。您还需要一个测试集。如果你的unigram模型不是字典的形式,请告诉我你使用了什么数据结构,所以我可以相应地调整它到我的解决方案。

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

更新:

当你要求一个完整的工作示例时,这是一个非常简单的例子。

假设这是我们的语料库:

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

以下是我们首先构建unigram模型的方法:

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

我们的模型在这里得到了平滑。对于超出其知识范围的词语,它指定0.01的概率很低。我已经告诉过你如何计算困惑:

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

现在我们可以在两个不同的测试集上测试它:

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

您将获得以下结果:

>>> 
49.09452736318415
99.99999999999997

请注意,在处理困惑时,我们会尝试减少它。对于某个测试集而言较少困惑的语言模型比具有较大困惑的模型更令人满意。在第一个测试集中,单词Monty包含在unigram模型中,因此困惑的相应数字也较小。

答案 1 :(得分:-1)

感谢您的代码段!不该' T:

for word in model:
        model[word] = model[word]/float(sum(model.values()))
相反:

v = float(sum(model.values()))
for word in model:
        model[word] = model[word]/v

哦......我看到已经回答了......