愚蠢退避中的折扣价值

时间:2016-03-05 02:34:31

标签: python nlp smoothing

我正在关注NLP教程here(6' 58'') - 关于愚蠢退避平滑算法的部分。 在教程视频和implementation of bi-gram level stupid-backoff中,他们使用折扣值= 0.4

Tutorial Slide

实施二元回退:

def score(self, sentence):
    score = 0.0
    previous = sentence[0]
    for token in sentence[1:]:
        bicount = self.bigramCounts[(previous, token)]
        bi_unicount = self.unigramCounts[previous]
        unicount = self.unigramCounts[token]
        if bicount > 0:
            score += math.log(bicount)
            score -= math.log(bi_unicount)
        else:
            score += math.log(0.4)     // discount here
            score += math.log(unicount + 1)
            score -= math.log(self.total + self.vocab_size)
        previous = token
    return score

但是在trigram-level implementation中,折扣价值是1。

def score(self, sentence):
    score = 0.0
    fst = sentence[0]
    snd = sentence[1]
    for token in sentence[2:]:
        tricount = self.trigramCounts[(fst, snd, token)]
        tri_bicount = self.bigramCounts[(fst, snd)]
        bicount = self.bigramCounts[(snd, token)]
        bi_unicount = self.unigramCounts[snd]
        unicount = self.unigramCounts[token]
        if tricount > 0:
            score += math.log(tricount)
            score -= math.log(tri_bicount)
        elif bicount > 0:
            score += math.log(bicount)             // no discount here
            score -= math.log(bi_unicount)
        else:
            score += math.log((unicount + 1))      // no discount here
            score -= math.log(self.total + self.vocab_size)
        fst, snd = snd, token
    return score

当我运行project时 - 折扣设置为0.4和1为tri-gram级别,我得到了分数:

tri-gram with discount = 0.4< bi-gram with discount = 0.4< tri-gram with discount =1

很容易知道为什么 - 折扣= 0.4,三元组的最终else成为:

else:
    score += math.log(0.4)      // -> -0.3979
    score += math.log(0.4)      // -> -0.3979
    score += math.log((unicount + 1))      // no discount here
    score -= math.log(self.total + self.vocab_size)

所以我真的很困惑 - 0.4值来自哪里?

1 个答案:

答案 0 :(得分:0)

看一下提出愚蠢退避的谷歌paper