Question

我正在关注某些数据集中使用CountVectorizer的{{3}}。

问题：count_vect.vocabulary_.viewitems()列出了所有条款及其频率。你如何根据出现次数对它们进行排序？

sorted( count_vect.vocabulary_.viewitems() )似乎不起作用。

Answer 1

vocabulary_.viewitems()实际上并未列出术语及其频率，而是列出了从术语到索引的映射。 fit_transform方法返回频率（每个文档），返回稀疏（coo）矩阵，其中行是文档，列是单词（列索引通过词汇表映射到单词）。您可以通过

获取总频率

matrix = count_vect.fit_transform(doc_list)
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis=0))    
# sort from largest to smallest
print sorted(freqs, key=lambda x: -x[1])

找到Scikit-learn分类器中最常用的术语

1 个答案: