为什么手动计算的tf-idf值和sklearn tf-idf库计算的值会有所不同?

时间:2019-05-13 13:30:18

标签: python tf-idf tfidfvectorizer

我正在通过代码手动计算tf-idf值。然后,我尝试使用sklearn tfidf矢量化程序进行检查。奇怪的是,这些值不匹配,我似乎也找不到原因。以下是我编写的代码和示例数据。有人可以解释我在做什么错吗? 我将每个字符串作为每个文档,并将其中的单词作为标记。

data = ['aron carey', 'renee marie parilak teate',
        'colt carey', 'renee m teate',
        'ciara acton', 'adrian adams', 'tyson', 'tyson mclure']

def CalcTfIdf(data):
     tfIdf1 = []
     vecLst1 = []
     count = 0
     for s1 in data:
         normVal1 = 0
         count = count + 1
         for tok in s1.split():
                 df1 = 0

                 #calculate term freq
                 tf = s1.split().count(tok)

                 #calculate doc frequency
                 for l in data:
                     df1 = df1 + list(set(l.split())).count(tok)

                 idf = math.log10(len(data)/(1 + df1))
                 tfIdf = tf * idf
                 tfIdf1.append((count, tok, tfIdf))

def CalcTfIdfLibrary(data):
     vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
     X = vectorizer.fit_transform(data)
     print(X)

第7个文档,其中仅包含一个单词“ Tyson”,sklearn库的tf-idf值为1,根据我的计算,为0.42。怎么会是1?

  • 结果1: [(1,'aron',0.6020599913279624),(1,'carey,0.4259687322722811),(2,'renee',0.4259687322722811),(2,'marie',0.6020599913279624),(2,'parilak',0.6020599913279624) ,(2,'teate',0.4259687322722811),(3,'colt',0.6020599913279624),(3,'carey',0.4259687322722811),(4,'renee',0.4259687322722811),(4,'m',0.6020599913279624) ,(4,'teate',0.4259687322722811),(5,'ciara',0.6020599913279624),(5,'acton',0.6020599913279624),(6,'adrian',0.6020599913279624),(6,'adams',0.6020599913279624) ,(7,'tyson',0.4259687322722811),(8,'tyson',0.4259687322722811),(8,'mclure',0.6020599913279624)]

  • 结果2: (0,3)0.7664298449085388 (0,4)0.6423280258820045 (1,11)0.45419450284733365 (1,8)0.5419477406385818 (1,10)0.5419477406385818 (1,12)0.45419450284733365 (2,4)0.6423280258820045 (2,6)0.7664298449085388 (3,11)0.540442546068122 (3,12)0.540442546068122 (3,7)0.6448594488714666 (4,5)0.7071067811865475 (4,0)0.7071067811865475 (5,2)0.7071067811865475 (5,1)0.7071067811865475 (6、13)1.0 (7,13)0.6423280258820045 (7,9)0.7664298449085388

0 个答案:

没有答案