我正在尝试在大约200k令牌的普通语料库上生成tf-idf。我首先在术语频率上制作了矢量计数器。然后我生成了tf-idf矩阵并获得了以下结果。我的代码是
from sklearn.feature_extraction.text import TfidfVectorizer
with open("D:\history.txt", encoding='utf8') as infile:
contents = infile.readlines()
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
min_df=0.0,
use_idf=True, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to contents
print(tfidf_matrix)
结果
(0, 8371) 0.0296607326158
(0, 27755) 0.159032195629
(0, 59369) 0.0871403881289
: :
(551, 64746) 0.0324104689629
(551, 10118) 0.0324104689629
(551, 9308) 0.0324104689629
虽然我希望以下列形式获得结果
(551, good ) 0.0324104689629
答案 0 :(得分:0)
您可以使用稀疏输出IpcChannel("blabla123")
和TfidfVectorizer.get_feature_names()中的索引来生成所需的输出:
tfidf_matrix