Question

我是scikit和scipy的新手，我尝试了以下内容：

# -- coding: utf-8 --
from sklearn.feature_extraction import FeatureHasher
data = [[('this', 'is'), ('is', 'a'), ('a', 'text')],
        [('and', 'one'), ('one', 'more')],]

fh = FeatureHasher(input_type='string')
X = fh.transform(((' '.join(x) for x in sample) for sample in data))
print X

问题在于我不理解输出：

  (0, 18882)    1.0
  (0, 908056)   1.0
  (0, 1003453)  1.0
  (1, 433727)   1.0
  (1, 575892)   -1.0

有人可以解释一下这个输出是什么意思吗？我阅读了FeatureHasher（）方法的documentation但没有解释它。

Answer 1

这是在scipy.sparse。

中实现的大型稀疏矩阵的显示

  (0, 18882)    1.0
  (0, 908056)   1.0
  (0, 1003453)  1.0
  (1, 433727)   1.0
  (1, 575892)   -1.0

X.shape会给出它的维度。 X.todense()生成一个常规numpy矩阵，其中包含大量零值。

这是一个小得多的稀疏矩阵的样本：

In [182]: from scipy import sparse
In [183]: X=sparse.csr_matrix([[0,1,2],[1,0,0]])
In [184]: X
Out[184]: 
<2x3 sparse matrix of type '<type 'numpy.int32'>'
    with 3 stored elements in Compressed Sparse Row format>
In [185]: print X
  (0, 1)    1
  (0, 2)    2
  (1, 0)    1
In [186]: X.todense()
Out[186]: 
matrix([[0, 1, 2],
        [1, 0, 0]])
In [187]: X.toarray()
Out[187]: 
array([[0, 1, 2],
       [1, 0, 0]])

print X以(row, col) value格式显示此矩阵的非零值。

您的X至少是(2,1003454)矩阵，但大多数为零。

什么是CSR格式的scipy.sparse矩阵？

1 个答案: