Question

我正在通过代码手动计算tf-idf值。然后，我尝试使用sklearn tfidf矢量化程序进行检查。奇怪的是，这些值不匹配，我似乎也找不到原因。以下是我编写的代码和示例数据。有人可以解释我在做什么错吗？我将每个字符串作为每个文档，并将其中的单词作为标记。

data = ['aron carey', 'renee marie parilak teate',
        'colt carey', 'renee m teate',
        'ciara acton', 'adrian adams', 'tyson', 'tyson mclure']

def CalcTfIdf(data):
     tfIdf1 = []
     vecLst1 = []
     count = 0
     for s1 in data:
         normVal1 = 0
         count = count + 1
         for tok in s1.split():
                 df1 = 0

                 #calculate term freq
                 tf = s1.split().count(tok)

                 #calculate doc frequency
                 for l in data:
                     df1 = df1 + list(set(l.split())).count(tok)

                 idf = math.log10(len(data)/(1 + df1))
                 tfIdf = tf * idf
                 tfIdf1.append((count, tok, tfIdf))

def CalcTfIdfLibrary(data):
     vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
     X = vectorizer.fit_transform(data)
     print(X)

第7个文档，其中仅包含一个单词“ Tyson”，sklearn库的tf-idf值为1，根据我的计算，为0.42。怎么会是1？

结果1： [（1，'aron'，0.6020599913279624），（1，'carey，0.4259687322722811），（2，'renee'，0.4259687322722811），（2，'marie'，0.6020599913279624），（2，'parilak'，0.6020599913279624），（2，'teate'，0.4259687322722811），（3，'colt'，0.6020599913279624），（3，'carey'，0.4259687322722811），（4，'renee'，0.4259687322722811），（4，'m'，0.6020599913279624），（4，'teate'，0.4259687322722811），（5，'ciara'，0.6020599913279624），（5，'acton'，0.6020599913279624），（6，'adrian'，0.6020599913279624），（6，'adams'，0.6020599913279624），（7，'tyson'，0.4259687322722811），（8，'tyson'，0.4259687322722811），（8，'mclure'，0.6020599913279624）]
结果2：（0，3）0.7664298449085388 （0，4）0.6423280258820045 （1，11）0.45419450284733365 （1，8）0.5419477406385818 （1，10）0.5419477406385818 （1，12）0.45419450284733365 （2，4）0.6423280258820045 （2，6）0.7664298449085388 （3，11）0.540442546068122 （3，12）0.540442546068122 （3，7）0.6448594488714666 （4，5）0.7071067811865475 （4，0）0.7071067811865475 （5，2）0.7071067811865475 （5，1）0.7071067811865475 （6、13）1.0 （7，13）0.6423280258820045 （7，9）0.7664298449085388

为什么手动计算的tf-idf值和sklearn tf-idf库计算的值会有所不同？

0 个答案: