Question

我正在尝试在Python中从头实现tf-idf矢量化程序。我计算了TDF值，但该值与使用sklearn的TfidfVectorizer（）计算的TDF值不匹配。

我在做什么错了？

corpus = [
 'this is the first document',
 'this document is the second document',
 'and this is the third one',
 'is this the first document',
]

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy

sentence = []
for i in range(len(corpus)):
sentence.append(corpus[i].split())

word_freq = {}   #calculate document frequency of a word
for i in range(len(sentence)):
    tokens = sentence[i]
    for w in tokens:
        try:
            word_freq[w].add(i)  #add the word as key 
        except:
            word_freq[w] = {i}  #if it exists already, do not add.

for i in word_freq:
    word_freq[i] = len(word_freq[i])  #Counting the number of times a word(key)is in the whole corpus thus giving us the frequency of that word.

def idf():
    idfDict = {}
    for word in word_freq:
        idfDict[word] = math.log(len(sentence) / word_freq[word])
    return idfDict
idfDict = idf()

预期输出：（使用vectorizer.idf_获得的输出）

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073 1.22314355 1.91629073 1.        ]

实际输出：（这些值是相应键的idf值。

{'and': 1.3862943611198906,
'document': 0.28768207245178085,
'first': 0.6931471805599453,
'is': 0.0,
'one': 1.3862943611198906,
'second': 1.3862943611198906,
'the': 0.0,
'third': 1.3862943611198906,
'this': 0.0
 }

Answer 1

有一些默认参数可能会影响sklearn的计算，但此处似乎很重要的特定参数是：

smooth_idf : boolean (default=True) 通过在文档频率上增加一个来平滑idf权重，就好像看到一个额外的文档中包含集合中每个术语的一次一样。防止零除。

如果您从每个元素中减去一个并提高e到该幂，则对于n的低值，您将获得非常接近5 / n的值：

1.91629073 => 5/2
1.22314355 => 5/4
1.51082562 => 5/3
1 => 5/5

无论如何，没有单一的tf-idf实现；您定义的指标只是一种试探法，试图观察某些属性（例如“较高的idf应与语料库中的稀有性相关”），因此我不必担心实现相同的实现。

sklearn似乎使用过： log((document_length + 1) / (frequency of word + 1)) + 1 这就好比有一个文档在语料库中包含每个单词一样。

编辑：TfIdfNormalizer的文档字符串证实了最后一段。

从头开始实现TF-IDF矢量化器

1 个答案: