Question

我正在使用sklearn.feature_extraction.text中的HashingVectorizer函数，但我不知道它的工作原理。

我的代码

from sklearn.feature_extraction.text import HashingVectorizer
corpus = [ 'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
vectorizer = HashingVectorizer(n_features=2**3)
X = vectorizer.fit_transform(corpus)
print(X)

我的结果

(0, 0)        -0.8944271909999159
(0, 5)        0.4472135954999579
(0, 6)        0.0
(1, 0)        -0.8164965809277261
(1, 3)        0.4082482904638631
(1, 5)        0.4082482904638631
(1, 6)        0.0
(2, 4)        -0.7071067811865475
(2, 5)        0.7071067811865475
(2, 6)        0.0
(3, 0)        -0.8944271909999159
(3, 5)        0.4472135954999579
(3, 6)        0.0

我阅读了很多有关“散列技巧”的论文，例如本文https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f

我了解本文，但看不到与上面获得的结果之间的关系。

能否通过简单的示例向我解释HashingVectorizer的工作方式

Answer 1

结果是矩阵（大小为4x8）的sparse表示形式。

print(X.toarray())

输出：

[[-0.89442719  0.          0.          0.          0.          0.4472136
   0.          0.        ]
 [-0.81649658  0.          0.          0.40824829  0.          0.40824829
   0.          0.        ]
 [ 0.          0.          0.          0.         -0.70710678  0.70710678
   0.          0.        ]
 [-0.89442719  0.          0.          0.          0.          0.4472136
   0.          0.        ]]

要获取令牌的向量，我们计算其哈希值并获取矩阵中的列索引。该列是令牌的向量。

Answer 2

由于负值和默认的归一化，我认为结果没有意义。

如果您这样做：

vectorizer = HashingVectorizer(n_features=2**3,norm=None,alternate_sign=False)

您应该看到原始计数，并且结果应该开始有意义。如果要归一化的词频，则设置norm='l2'。

您要打印的结果基本上是(document_id,position_in_matrix) counts

有关更多信息，请参见HashingVectorizer vs. CountVectorizer上的这篇文章。

不了解sklearn的HashingVectorizer

2 个答案: