Question

英语不是我的母语，对于任何语法错误，我们深表歉意。

我看到许多文档在语言模型中添加了一种平滑处理，但我仍然对公式中的变量V感到非常困惑：

P (wi |w_i-1 ) = c(w_i-1 ,wi )+1  / c(w_i-1 )+V

在这个示例语料库中，我使用bigram

<s> John read Moby Dick </s>
<s> Mary read a different book </s>
<s> She read a book by Cher </s>

如果我想计算任何P（wi | w_i-1）。 V将为11 因为[w_i-1，w]的组合数是11 。但是我发现它不包含大小写[w_i-1，“ <” / s“>”]（否则V将为12）为什么我们不需要包括这种情况？是不是w_i-1在文章或句子的结尾？

Answer 1

这里有一个不错的教程：https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

考虑一个ngram语言模型（不进行平滑处理）：

p（w_i | w_i-1）= c（w_i-1 w_i）/ c（w_i-1）

p（w_1，w_2 ... w_n）= product_i = 1_to_n（p（w_i | w_i-1））

在代码中：

from collections import Counter
from functools import reduce, partial
from operator import mul

from nltk import ngrams

def prob_product(prob_list):
    return reduce(mul, prob_list, 1)

text = [['<s>', 'John', 'read', 'Moby', 'Dick', '</s>'], 
        ['<s>', 'Mary', 'read', 'a', 'different', 'book', '</s>'], 
        ['<s>', 'She', 'read', 'a', 'book', 'by', 'Cher', '</s>']]

bigram_counts = sum([Counter(ngrams(t, 2)) for t in text], Counter())
unigram_counts = sum([Counter(ngrams(t, 1)) for t in text], Counter())

count_S_John = bigram_counts[('<s>', 'John')]
count_S = unigram_counts[('<s>',)]

sentence = '<s> John read a book </s>'.split()
prob_S_John_read_a_book = prob_product([bigram_counts[bg]/unigram_counts[bg[:-1]]
                                        for bg in ngrams(sentence, 2)])

print(prob_S_John_read_a_book) # 0.555555

for bg in ngrams(sentence, 2):
    print(bg, bigram_counts[bg], unigram_counts[bg[:-1]])

[输出]：

0.55555
('<s>', 'John') 1 3
('John', 'read') 1 1
('read', 'a') 2 3
('a', 'book') 1 2
('book', '</s>') 1 2

使用附加平滑功能，又称Laplace smoothing，

p（w_i | w_i-1）=（1 + c（w_i-1 w_i））/（| V | + c（w_i-1））

其中|V|是令牌的数量（通常没有<s>和</s>）。

所以在代码中：

laplace_prob_S_John_read_a_book = prob_product([(1+bigram_counts[bg]) / (len(unigram_counts)-2 + unigram_counts[bg[:-1]])
                                                for bg in ngrams(sentence, 2)])

print(laplace_prob_S_John_read_a_book)

for bg in ngrams(sentence, 2):
    print(bg, 1+bigram_counts[bg], len(unigram_counts)-2 + unigram_counts[bg[:-1]])

[输出]：

0.00012075836251660427
('<s>', 'John') 2 14
('John', 'read') 2 12
('read', 'a') 3 14
('a', 'book') 2 13
('book', '</s>') 2 13

注意：len(unigram_counts)-2用于从编号中删除<s>和</s>。词汇中的单词。

上面的方法。

问：|V|为什么不考虑<s>和</s>？

A：一个可能的原因是因为我们从不考虑语言模型中的空句子，因此<s>和</s>本身和词汇{{ 1}}排除了它们。

可以在|V|中添加它们吗？

A：实际上，如果|V| 足够大，则|V|和<s>的值为+2几乎没有什么区别。只要</s>在所有计算中都一致且固定一致且足够大，则相对于具有相同语言模型的另一个句子而言，任何句子的语言模型概率都不应太高不同。

为什么在语言模型中“加一平滑”不计入分母

1 个答案: