我正在尝试从此视频复制TFIDF示例:Using TF-IDF to convert unstructured text to useful features
据我所知,代码与示例中的代码相同,除了我使用.items(python 3)而不是.iteritems(python 2):
docA = "the cat sat on my face"
docB = "the dog sat on my bed"
bowA = docA.split(" ")
bowB = docB.split(" ")
wordSet= set(bowA).union(set(bowB))
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)
for word in bowA:
wordDictA[word]+=1
for word in bowB:
wordDictB[word]+=1
import pandas as pd
bag = pd.DataFrame([wordDictA, wordDictB])
print(bag)
def computeTF(wordDict,bow):
tfDict = {}
bowCount = len(bow)
for word, count in wordDict.items():
tfDict[word] = count / float(bowCount)
return tfDict
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)
def computeIDF(docList):
import math
idfDict = {}
N = len(docList)
#Count N of docs that contain word w
idfDict = dict.fromkeys(docList[0].keys(),0)
for doc in docList:
for word, val in doc.items():
if val > 0:
idfDict[word] +=1
for word, val in idfDict.items():
idfDict[word] = math.log(N/ float(val))
return idfDict
idfs = computeIDF([wordDictA, wordDictB])
def computeTFIDF(tfBow,idfs):
tfidf = {}
for word, val in tfBow.items():
tfidf[word] = val * idfs[word]
return tfidf
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
TF = pd.DataFrame([tfidfBowA, tfidfBowB])
print(TF)
结果表看起来应该是这样的,其中常用词(on,my,sat,the)的得分均为0:
bed cat dog face my on sat the
0 0.000000 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000
1 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000 0.000000
但是我的结果数据框看起来像这样,所有单词都有相同的分数,除了刚出现在文档上的那些(bed \ dog,cat \ face):
bed cat dog face my on sat the
0 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833 0.020833
1 0.020833 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833
如果我打印(idfs)我得到
{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}
在这里,两个文档中包含的单词都具有值0,然后将其用于衡量其重要性,因为它们对所有文档都是通用的。在使用computeTFIDF函数之前,数据如下所示:
{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}
由于该函数会将这两个数相乘," my" (idfs为0)应为0," dog" (根据示例,idfs为0.6931)应为(0,6931 * 0,1666 = 0,11)。相反,除了文档中没有的单词之外,我得到的数字为0.02083。除了python 2和3之间的iter \ iteritems的语法之外还有什么东西会弄乱我的代码吗?
答案 0 :(得分:1)
在转换为df
之前的倒数第二部分中,更改这两行 -
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
TO -
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)
对于计算Tfidf
,您必须调用函数computeTFIDF()
而不是computeTF()
<强>输出强>
tfidfBowA
{'bed': 0.0,
'cat': 0.11552453009332421,
'dog': 0.0,
'face': 0.11552453009332421,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
tfidfBowB
{'bed': 0.11552453009332421,
'cat': 0.0,
'dog': 0.11552453009332421,
'face': 0.0,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
希望有所帮助!