Question

我有一个大型语料库（大约40万个独特的句子）。我只想获取每个单词的TF-IDF分数。我试图通过扫描每个单词并计算频率来计算每个单词的分数，但是它花费的时间太长。

我用过：

  X= tfidfVectorizer(corpus)

来自sklearn，但是它直接返回句子的向量表示。有什么方法可以获取语料库中每个单词的TF-IDF分数吗？

Answer 1

要使用sklearn.feature_extraction.text.TfidfVectorizer（摘自文档）：

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)

现在，如果我打印X.toarray()：

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]

此2D数组中的每一行都引用一个文档，并且该行中的每个元素都引用相应单词的TF-IDF分数。要知道每个元素代表什么词，请查看.get_feature_names()函数。它将打印出单词列表。例如，在这种情况下，请查看第一个文档的行：

[0., 0.46979139, 0.58028582, 0.38408524, 0., 0., 0.38408524, 0., 0.38408524]

在示例中，.get_feature_names()返回以下内容：

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

因此，您将分数映射到这样的单词：

{'and': 0.0, 'document': 0.46979139, 'first': 0.58028582, 'is': 0.38408524, 'one': 0.0, 'second': 0.0, 'the': 0.38408524, 'third': 0.0, 'this': 0.38408524}

如何获得单词的TF-IDF分数？

1 个答案: