使用scikit-learn特征提取模块计算tf-idf

时间:2018-06-13 14:28:17

标签: python machine-learning scikit-learn tf-idf

请在标记之前完整阅读帖子。我已经在互联网上到处寻找,试图解决这个问题。

我只是试图通过this machine learning tutorial并且无法复制结果。此外,我不能在数学上复制我正在产生的结果。一切都很清楚,直到我尝试生成tfidf,如下面代码中的注释所示。

具体来说,我是否正确生成下面的tfidf? 如果是这样,我怎样才能以数学方式进行复制,因为我的理解应该只是tf * idf,其中两个都是简单的计算,如下面的评论所示。

提前致谢!

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# Vocabulary: {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}
freq_term_matrix = vectorizer.transform(test_set)
print freq_term_matrix.todense()
# [[0 1 1 1]
# [0 1 0 2]]

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
# The arguements as the are passed int0 TfidfTransformer: 
# TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

print "IDF:", tfidf.idf_
# This is where the confusion begins. What are these numbers?
# IDF: [ 2.09861229  1.          1.40546511  1.        ]


tf_idf_matrix = tfidf.transform(freq_term_matrix)
print tf_idf_matrix.todense()
# It's my understanding that these are simply tf * idf where 
# tf = (number of times a word appears in a doc) / (number of words in document)
# idf = log((number of documents) / (number of docs the word appears in))
# [[ 0.          0.50154891  0.70490949  0.50154891]
# [ 0.          0.4472136   0.          0.89442719]]

0 个答案:

没有答案