Question

我使用Python和NLTK构建语言模型如下：

from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)
# Thanks to miku, I fixed this problem
print lm.prob("word", ["This is a context which generates a word"])
>> 0.00493261081006
# But I got another program like this one...
print lm.prob("b", ["This is a context which generates a word"])

但它似乎不起作用。结果如下：

>>> print lm.prob("word", "This is a context which generates a word")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
    return self._alpha(context) * self._backoff.prob(word, context[1:])
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
    return self._alpha(context) * self._backoff.prob(word, context[1:])
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 82, in prob
    "context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting

任何人都可以帮助我吗？谢谢！

Answer 1

我知道这个问题已经过时了，但每次我在谷歌nltk的NgramModel课程中都会弹出。 NgramModel的概率实现有点不直观。提问者很困惑。据我所知，答案并不是很好。由于我不经常使用NgramModel，这意味着我感到困惑。没有了。

源代码位于此处：https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py。这是NgramModel的prob方法的定义：

def prob(self, word, context):
    """
    Evaluate the probability of this word in this context using Katz Backoff.

    :param word: the word to get the probability of
    :type word: str
    :param context: the context the word is in
    :type context: list(str)
    """

    context = tuple(context)
    if (context + (word,) in self._ngrams) or (self._n == 1):
        return self[context].prob(word)
    else:
        return self._alpha(context) * self._backoff.prob(word, context[1:])

（ note ：'self [context] .prob（word）相当于'self._model [context] .prob（word）'）

好。现在至少我们知道要寻找什么。上下文需要什么？让我们看一下构造函数的摘录：

for sent in train:
    for ngram in ingrams(chain(self._lpad, sent, self._rpad), n):
        self._ngrams.add(ngram)
        context = tuple(ngram[:-1])
        token = ngram[-1]
        cfd[context].inc(token)

if not estimator_args and not estimator_kwargs:
    self._model = ConditionalProbDist(cfd, estimator, len(cfd))
else:
    self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)

好的。构造函数从条件频率分布中创建条件概率分布（self._model），其“上下文”是unigrams的元组。这告诉我们'context'应该不是一个字符串或一个包含单个多字符串的列表。 'context'必须是包含unigrams的可迭代的东西。事实上，要求更严格一些。这些元组或列表的大小必须为n-1。这样想吧。你告诉它是一个三元模型。你最好给它三卦的适当背景。

让我们通过一个更简单的例子看到这一点：

>>> import nltk
>>> obs = 'the rain in spain falls mainly in the plains'.split()
>>> lm = nltk.NgramModel(2, obs, estimator=nltk.MLEProbDist)
>>> lm.prob('rain', 'the') #wrong
0.0
>>> lm.prob('rain', ['the']) #right
0.5
>>> lm.prob('spain', 'rain in') #wrong
0.0
>>> lm.prob('spain', ['rain in']) #wrong
'''long exception'''
>>> lm.prob('spain', ['rain', 'in']) #right
1.0

（作为旁注，实际上尝试用MLE做任何事情作为NgramModel中的估算器是一个坏主意。事情会崩溃。我保证。）

至于最初的问题，我想我对OP想要的最好的猜测是：

print lm.prob("word", "generates a".split())
print lm.prob("b", "generates a".split())

...但是这里有很多误解，我无法说出他实际上想要做什么。

Answer 2

快速修复：

print lm.prob("word", ["This is a context which generates a word"])
# => 0.00493261081006

Answer 3

关于您的第二个问题：发生这种情况是因为"b"未出现在布朗语料库类别news中，因为您可以通过以下方式进行验证：

>>> 'b' in brown.words(categories='news')
False

，而

>>> 'word' in brown.words(categories='news')
True

我承认错误信息非常含糊，因此您可能希望向NLTK作者提交错误报告。

Answer 4

暂时我会远离NLTK的NgramModel。当n> 1时，目前存在平滑错误导致模型极大地高估了可能性。如果你最终使用NgramModel，你肯定应该在git问题跟踪器中提到修复：https://github.com/nltk/nltk/issues/367

nltk语言模型（ngram）从上下文计算单词的概率

4 个答案: