请在标记之前完整阅读帖子。我已经在互联网上到处寻找,试图解决这个问题。
我只是试图通过this machine learning tutorial并且无法复制结果。此外,我不能在数学上复制我正在产生的结果。一切都很清楚,直到我尝试生成tfidf,如下面代码中的注释所示。
具体来说,我是否正确生成下面的tfidf? 如果是这样,我怎样才能以数学方式进行复制,因为我的理解应该只是tf * idf,其中两个都是简单的计算,如下面的评论所示。
提前致谢!
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# Vocabulary: {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}
freq_term_matrix = vectorizer.transform(test_set)
print freq_term_matrix.todense()
# [[0 1 1 1]
# [0 1 0 2]]
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
# The arguements as the are passed int0 TfidfTransformer:
# TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
print "IDF:", tfidf.idf_
# This is where the confusion begins. What are these numbers?
# IDF: [ 2.09861229 1. 1.40546511 1. ]
tf_idf_matrix = tfidf.transform(freq_term_matrix)
print tf_idf_matrix.todense()
# It's my understanding that these are simply tf * idf where
# tf = (number of times a word appears in a doc) / (number of words in document)
# idf = log((number of documents) / (number of docs the word appears in))
# [[ 0. 0.50154891 0.70490949 0.50154891]
# [ 0. 0.4472136 0. 0.89442719]]