Question

scikit-learn documentation说

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

然而，为什么会df(d, t) = 0？如果一个术语没有出现在任何文本中，那么字典就不会出现这个术语了，是吗？

Answer 1

此功能在TfidfVectorizer中很有用。根据{{3}}，此类可以提供预定义的vocabulary。如果在列车数据中从未见过词汇表中的单词，但在测试中发生，则smooth_idf允许成功处理。

train_texts = ['apple mango', 'mango banana']
test_texts = ['apple banana', 'mango orange']
vocab = ['apple', 'mango', 'banana', 'orange']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer1 = TfidfVectorizer(smooth_idf=True, vocabulary=vocab).fit(train_texts)
vectorizer2 = TfidfVectorizer(smooth_idf=False, vocabulary=vocab).fit(train_texts)
print(vectorizer1.transform(test_texts).todense()) # works okay
print(vectorizer2.transform(test_texts).todense()) # raises a ValueError

输出：

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.43016528  0.          0.90275015]]
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

smooth_idf是多余的吗？

1 个答案: