使用sklearn.feature_extraction.text.TfidfVectorizer的tf-idf特征权重

时间:2014-05-21 20:05:36

标签: python scikit-learn tf-idf

此页面:http://scikit-learn.org/stable/modules/feature_extraction.html提到:

  

由于tf-idf非常常用于文本功能,因此还有另一个名为 TfidfVectorizer 的类,它结合了 CountVectorizer TfidfTransformer <的所有选项/ strong>在一个模型中。

然后我按照代码在我的语料库上使用fit_transform()。如何获得fit_transform()计算的每个特征的权重?

我试过了:

In [39]: vectorizer.idf_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-5475eefe04c0> in <module>()
----> 1 vectorizer.idf_

AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'

但缺少此属性。

由于

2 个答案:

答案 0 :(得分:77)

从版本0.15开始,可以通过idf_对象的属性TfidfVectorizer检索每个要素的tf-idf分数:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

输出:

{u'is': 1.0,
 u'nice': 1.4054651081081644,
 u'strange': 1.4054651081081644,
 u'this': 1.0,
 u'very': 1.0}

正如评论中所讨论的,在版本0.15之前,解决方法是通过所谓的隐藏idf_({1}}的实例)来访问属性_tfidf:< / p>

TfidfTransformer

应该提供与上面相同的输出。

答案 1 :(得分:1)

另见this关于如何获取所有文件的TF-IDF值:

feature_names = tf.get_feature_names()
doc = 0
feature_index = X[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [X[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print w, s

this 0.448320873199
is 0.448320873199
very 0.448320873199
strange 0.630099344518

#and for doc=1
this 0.448320873199
is 0.448320873199
very 0.448320873199
nice 0.630099344518

我认为结果是按文件规范化的:

&GT;&GT;&GT; 0.448320873199的 2 + 0.448320873199 2 + 0.448320873199的 2 + 0.630099344518 2     0.9999999999997548