试图复制TFIDF示例,乘法返回错误的数字

时间:2018-05-09 09:36:00

标签: python python-3.x tf-idf

我正在尝试从此视频复制TFIDF示例:Using TF-IDF to convert unstructured text to useful features

据我所知,代码与示例中的代码相同,除了我使用.items(python 3)而不是.iteritems(python 2):

docA = "the cat sat on my face"
docB = "the dog sat on my bed"

bowA = docA.split(" ")
bowB = docB.split(" ")

wordSet= set(bowA).union(set(bowB))

wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

for word in bowA:
        wordDictA[word]+=1

for word in bowB:
        wordDictB[word]+=1

import pandas as pd

bag = pd.DataFrame([wordDictA, wordDictB])

print(bag)

def computeTF(wordDict,bow):
        tfDict = {}
        bowCount = len(bow)
        for word, count in wordDict.items():
                tfDict[word] = count / float(bowCount)
        return tfDict

tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

def computeIDF(docList):
        import math
        idfDict = {}
        N = len(docList)
        #Count N of docs that contain word w
        idfDict = dict.fromkeys(docList[0].keys(),0)
        for doc in docList:
                for word, val in doc.items():
                        if val > 0:
                                idfDict[word] +=1
        for word, val in idfDict.items():
                idfDict[word] = math.log(N/ float(val))
        return idfDict

idfs = computeIDF([wordDictA, wordDictB])

def computeTFIDF(tfBow,idfs):
        tfidf = {}
        for word, val in tfBow.items():
                tfidf[word] = val * idfs[word]
        return tfidf

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

TF = pd.DataFrame([tfidfBowA, tfidfBowB])

print(TF)

结果表看起来应该是这样的,其中常用词(on,my,sat,the)的得分均为0:

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000   
1  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000  0.000000 

但是我的结果数据框看起来像这样,所有单词都有相同的分数,除了刚出现在文档上的那些(bed \ dog,cat \ face):

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833  0.020833   
1  0.020833  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833 

如果我打印(idfs)我得到

{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}

在这里,两个文档中包含的单词都具有值0,然后将其用于衡量其重要性,因为它们对所有文档都是通用的。在使用computeTFIDF函数之前,数据如下所示:

{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}

由于该函数会将这两个数相乘," my" (idfs为0)应为0," dog" (根据示例,idfs为0.6931)应为(0,6931 * 0,1666 = 0,11)。相反,除了文档中没有的单词之外,我得到的数字为0.02083。除了python 2和3之间的iter \ iteritems的语法之外还有什么东西会弄乱我的代码吗?

1 个答案:

答案 0 :(得分:1)

在转换为df之前的倒数第二部分中,更改这两行 -

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

TO -

tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

对于计算Tfidf,您必须调用函数computeTFIDF()而不是computeTF()

<强>输出

tfidfBowA
{'bed': 0.0,
 'cat': 0.11552453009332421,
 'dog': 0.0,
 'face': 0.11552453009332421,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

tfidfBowB
{'bed': 0.11552453009332421,
 'cat': 0.0,
 'dog': 0.11552453009332421,
 'face': 0.0,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

希望有所帮助!