Question

为了将我的问题放在上下文中，我想训练和测试/比较几种（神经）语言模型。为了专注于模型而不是数据准备，我选择使用来自nltk的布朗语料库并训练以nltk为基准的Ngrams模型（比较其他LM）。

所以我的第一个问题实际上是关于nltk的Ngram模型的行为，我发现它是可疑的。由于代码相当短，我在这里粘贴：

import nltk

print "... build"
brown = nltk.corpus.brown
corpus = [word.lower() for word in brown.words()]

# Train on 95% f the corpus and test on the rest
spl = 95*len(corpus)/100
train = corpus[:spl]
test = corpus[spl:]

# Remove rare words from the corpus
fdist = nltk.FreqDist(w for w in train)
vocabulary = set(map(lambda x: x[0], filter(lambda x: x[1] >= 5, fdist.iteritems())))

train = map(lambda x: x if x in vocabulary else "*unknown*", train)
test = map(lambda x: x if x in vocabulary else "*unknown*", test)

print "... train"
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) 
lm = NgramModel(5, train, estimator=estimator)

print "len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) )
print "perplexity(test) =", lm.perplexity(test)

我发现非常可疑的是我得到了以下结果：

... build
... train
len(corpus) = 1161192, len(vocabulary) = 13817, len(train) = 1103132, len(test) = 58060
perplexity(test) = 4.60298447026

令人困惑的是，似乎Ngram建模在该语料库上非常好。如果我的解释是正确的，那么模型应该能够在平均大约5次尝试中猜出正确的单词（尽管有13817种可能性......）。如果你能分享你对这种困惑的价值的体验（我真的不相信）？我没有发现网上nltk的ngram模型的任何投诉（但也许我做错了）。你知道Ngram模型和计算困惑的NLTK的一个很好的替代品吗？

谢谢！

Answer 1

因为你正在使用五角星模型，所以你会感到很困惑。如果你使用二元模型，你的结果将会更加规律，大约为50-1000（或大约5到10位）。

鉴于您的意见，您使用的是NLTK-3.0alpha吗？你不应该，至少不是语言建模：

https://github.com/nltk/nltk/issues?labels=model

事实上，整个model模块已经从NLTK-3.0a4预发布版中删除，直到问题得到解决。

Ngram模型和NLTK中的困惑

1 个答案: