Question

我正在使用scikit-learn来查找tf-idf值。

我有一组documents喜欢：

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

我想创建一个这样的矩阵：

   Docs      blue    bright       sky       sun
   D1 tf-idf 0.0000000 tf-idf 0.0000000
   D2 0.0000000 tf-idf 0.0000000 tf-idf
   D3 0.0000000 tf-idf tf-idf tf-idf

所以，Python中的代码是：

import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)

t1 = transformer.fit_transform(train_set).todense()
print t1

我得到的结果矩阵是：

[[ 0.79596054  0.          0.60534851  0.        ]
 [ 0.          0.4472136   0.          0.89442719]
 [ 0.          0.57735027  0.57735027  0.57735027]]

如果我进行手算，那么矩阵应为：

            Docs  blue      bright       sky       sun
            D1    0.2385    0.0000000  0.0880    0.0000000
            D2    0.0000000 0.0880     0.0000000 0.0880
            D3    0.0000000 0.058      0.058     0.058

我的计算方式与blue = tf和1/2 = 0.5 idf为log(3/1) = 0.477121255。因此tf-idf = tf*idf = 0.5*0.477 = 0.2385。这样，我正在计算其他tf-idf值。现在，我想知道为什么我在手计算矩阵和Python矩阵中得到不同的结果？哪个给出了正确的结果？我在手工计算中做错了什么，或者我的Python代码中有什么问题吗？

Answer 1

有两个原因：

您忽略了平滑，这种情况经常发生在这种情况下
您假设基数为10的对数

根据source，sklearn不会使用这样的假设。

首先，它平滑了文档计数（所以没有0）：

df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

它使用自然对数（np.log(np.e)==1）

idf = np.log(float(n_samples) / df) + 1.0

还应用了默认l2规范化。简而言之，scikit-learn做得更多，更好，更小的东西＆＃34;同时计算tfidf。这些方法（他们或你的）都不好。他们只是更先进。

Answer 2

smooth_idf：boolean，default = True

使用平滑版本idf。有很多版本。在python中，使用以下版本：$ 1 + log（（N + 1）/ n + 1））$，其中$ N $是文档总数，$ n $是包含该术语的文档数。

tf : 1/2, 1/2
idf with smoothing: (log(4/2)+1) ,(log(4/3)+1)
tf-idf : 1/2* (log(4/2)+1) ,1/2 * (log(4/3)+1)
L-2 normalization: 0.79596054 0.60534851

顺便说一下，原问题中的第二个可能是错的，应该是一样的。 my out put from python

使用scikit-learn和hand计算的tf-idf矩阵值的差异

2 个答案: