Question

我有一个包含5000条评论的文档。我在该文档上应用了tf-idf。 sample_data 包含5000条评论。我在 1克范围的sample_data上应用了tf-idf矢量化器。现在我想获得前1000个字来自具有最高tf-idf值的sample_data。谁能告诉我如何获得热门词汇？

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)

Answer 1

TF-IDF值取决于各个文档。通过使用max_features parameter of TfidfVectorizer，您可以根据计数（Tf）获得前1000个术语：

max_features：整数或无，默认=无

If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.

只需：

tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), max_features=1000)

在使用文档'idf'拟合（学习）文档后，您甚至可以从tf_idf_vect获得idf_（全局术语权重）：

idf_：数组，形状= [n_features]或无
  The learned idf vector (global term weights) when use_idf is set to True,  

在致电tf_idf_vect.fit(sample_data)之后执行此操作：

idf = tf_idf_vect.idf_

然后从它们中选择前1000个，然后根据那些选定的特征重新拟合数据。

但是您无法通过“ tf-idf ”获得前1000名，因为tf-idf是单个文档中带有tf的术语idf的乘积（全局）词汇表。因此，对于在单个文档中出现两次的相同单词，其tf-idf的值是在另一个文档中仅出现一次的相同单词的两倍。如何比较同一术语的不同值。希望这可以弄清楚。

如何使用TF-IDF向量选择前1000个单词？

1 个答案: