Question

我在手动计算tf-idf的值时遇到了麻烦。 Python scikit不断吐出不同于我期望的值。

我一直在读那个

idf(term) =  log(# of docs/ # of docs with term)

如果是这样，如果没有关于该术语的文档，您是否会得到除以零的错误？

为了解决这个问题，我读到你做了

log (# of docs / # of docs with term + 1 )

但是如果这个术语出现在每个文档中，那么你就得到了 log（n / n + 1）这是消极的，这对我来说并不合理。

我没有得到什么？

Answer 1

您描述的技巧实际上称为Laplace smoothing（或添加，或逐个平滑），并假设将相同的加数添加到分数的其他部分 - 您的案例中的分母或原始分母情况下。

换句话说，您应该在文档总数中加1：

log (# of docs + 1 / # of docs with term + 1)

顺便说一下，使用较小的加数通常会更好，特别是在小语料库的情况下：

log (# of docs + a / # of docs with term + a)，

其中a = 0.001或类似的东西。