我已经研究了一些使用Python计算文档中单词的TF-IDF分数的方法。我选择使用TextBlob。
我正在获得输出,但是,它们是负值。我知道这是不正确的(非负数量(tf)除以(对数)正数 数量(df)不会产生负值。)
我看过这里发布的以下问题:TFIDF calculating confusion但是没有帮助。
我如何计算得分:
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
然后我只是打印出他们得分的单词。
"hello, this is a test. a test is always good."
Top words in document
Word: good, TF-IDF: -0.06931
Word: this, TF-IDF: -0.06931
Word: always, TF-IDF: -0.06931
Word: hello, TF-IDF: -0.06931
Word: a, TF-IDF: -0.13863
Word: is, TF-IDF: -0.13863
Word: test, TF-IDF: -0.13863
凭借我所掌握的知识和我所看到的,可能是IDF计算不正确?
所有帮助将不胜感激。感谢
答案 0 :(得分:1)
没有输入/输出示例很难确定原因,一个可能的嫌疑人可能是idf()
方法,如果word
出现在每个blob
中,它将返回负值}。这是因为分母中的+1
,我认为,这是为了避免被零除
可能的解决方法可能是显式检查零:
def idf(word, bloblist):
x = n_containing(word, bloblist)
return math.log(len(bloblist) / (x if x else 1))
请注意,在这种情况下,出现在一个blob中并且根本没有blob会返回相同的值,您可以找到另一个解决方案以满足您的需求,只记得不要采用分数的log
答案 1 :(得分:-1)
IDF 分数应非负。问题出在idf
函数实现中。
请改为尝试:
from __future__ import division
from textblob import TextBlob
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return 1 + sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(float(1+len(bloblist)) / float(n_containing(word,bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
text = 'tf–idf, short for term frequency–inverse document frequency'
text2 = 'is a numerical statistic that is intended to reflect how important'
text3 = 'a word is to a document in a collection or corpus'
blob = TextBlob(text)
blob2 = TextBlob(text2)
blob3 = TextBlob(text3)
bloblist = [blob, blob2, blob3]
tf_score = tf('short', blob)
idf_score = idf('short', bloblist)
tfidf_score = tfidf('short', blob, bloblist)
print tf_score, idf_score, tfidf_score