Question

我正在关注此处的文档：

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

假设我已经有一个类似于 X.toarray() 中给出的词频矩阵，但我没有使用 CountVectorizer 来获取它。

我想对这个矩阵应用一个 TfIDF。有没有办法让我使用计数数组 + 字典并将此函数的一些逆函数用作构造函数来获得 fit_transformed X？

我正在寻找...

>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


>>> V = CountVectorizerConstructorPrime(array=(X.toarray()), 
                                        vocabulary=['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'])

这样：

>>> V == X
True

Answer 1

由X构造的CountVectorizer是SciPy的压缩稀疏行(csr)格式的稀疏矩阵。因此，您可以使用适当的 SciPy 函数直接从任何字数矩阵构建它：

from scipy.sparse import csr_matrix

V = csr_matrix(X.toarray())

现在 V 和 X 是相等的，虽然这可能不明显，因为 V == X 会给你另一个稀疏矩阵（或者更确切地说，尽管预期矩阵不是稀疏矩阵格式，见 this question）。但是你可以这样检查：

(V != X).toarray().any()

False

请注意，不需要单词列表，因为矩阵只对所有不同单词的频率进行编码，无论它们是什么。

在 sklearn 中将 X.toarray 逆向为 CountVectorizer

1 个答案: