from sklearn.feature_extraction.text import TfidfVectorizer
documents=["The car is driven on the road","The truck is
driven on the highway","the lorry is"]
fidf_transformer=TfidfVectorizer(smooth_idf=True,use_idf=True)
tfidf=tfidf_transformer.fit_transform(documents)
print(tfidf_transformer.vocabulary_)
print(tfidf.toarray())
{'the': 7, 'car': 0, 'on': 5, 'driven': 1, 'is': 3, 'road': 6, 'lorry': 4, 'truck': 8, 'highway': 2}
[[0.45171082 0.34353772 0. 0.26678769 0. 0.34353772 0.45171082 0.53357537 0. ]
[0. 0.34353772 0.45171082 0.26678769 0. 0.34353772 0. 0.53357537 0.45171082]
[0. 0. 0. 0.45329466 0.76749457 0. 0. 0.45329466 0. ]]
三个文档中的“ the”一词得分较低
答案 0 :(得分:0)
tfidf =词频(tf)*反文档频率(idf)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names())
print (X.toarray())
print ("---")
t = TfidfTransformer(use_idf=True, norm=None, smooth_idf=False)
a = t.fit_transform(X)
print (a.toarray())
print ("---")
print (t.idf_)
输出:
idf(the)低,但tf(the,doc1)= 2高,这正将其推高。
从上面的示例代码:
的idf(无范数,非平滑idf)== the == 1
但是tf(the,doc1)= 2且tf(is,doc1)= 1,这会增加tfidf(the,doc1)的tfidf值。
类似地,idf(car)= 2.09861229,但tf(car,doc1)= 1,=> tfidf(car,doc1)= 2.09861229,非常接近tfidf(the,doc1)。 idf的平滑进一步缩小了差距。
在大型语料库上,差异变得更加明显。
尝试通过禁用平滑和不规范化来运行代码,以查看对小主体的影响。
tfidf_transformer = TfidfVectorizer(smooth_idf = False,use_idf = True, norm = None)