使用python进行tf-idf结果分析

时间:2017-04-20 12:16:37

标签: python-3.x scikit-learn tf-idf

我正在尝试在大约200k令牌的普通语料库上生成tf-idf。我首先在术语频率上制作了矢量计数器。然后我生成了tf-idf矩阵并获得了以下结果。我的代码是

from sklearn.feature_extraction.text import TfidfVectorizer
with open("D:\history.txt", encoding='utf8') as infile:
    contents = infile.readlines()
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
                                 min_df=0.0,
                                 use_idf=True, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to contents

print(tfidf_matrix)

结果

  (0, 8371)     0.0296607326158
  (0, 27755)    0.159032195629
  (0, 59369)    0.0871403881289
   :    :
  (551, 64746)  0.0324104689629
  (551, 10118)  0.0324104689629
  (551, 9308)   0.0324104689629

虽然我希望以下列形式获得结果

   (551, good ) 0.0324104689629

1 个答案:

答案 0 :(得分:0)

您可以使用稀疏输出IpcChannel("blabla123")TfidfVectorizer.get_feature_names()中的索引来生成所需的输出:

tfidf_matrix