Question

在scipy中，当我将一个稀疏矩阵的切片与仅包含零的数组相乘时，结果是一个比之前更少或相同稀疏的矩阵，即使它应该更多或同样稀疏。将矩阵的部分设置为0或False也是如此：

>>> import numpy as np
>>> from scipy.sparse import csr_matrix as csr
>>> M = csr(np.random.random((8,8))>0.9)
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
        with 6 stored elements in Compressed Sparse Row format>
>>> M[:,0] = False
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
        with 12 stored elements in Compressed Sparse Row format>
>>> M[:,0].multiply(np.array([[False] for i in xrange(8)]))
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
        with 12 stored elements in Compressed Sparse Row format>

对于大型矩阵来说，这实际上是计算上昂贵的，因为它会迭代切片中的所有单元格，而不仅仅是非零单元格。

从数学/逻辑的角度来看，当乘以稀疏矩阵或向量时，所有空单元格肯定保持为空0*x == 0。设置为零也是如此：零单元不需要明确地设置为零。

处理此问题的最佳方法是什么？

我正在使用 scipy版本0.17.0

Answer 1

在使用稀疏矩阵时，更改稀疏模式通常是一项非常昂贵的操作，因此scipy不会默默地执行此操作。

如果要从稀疏矩阵中删除显式存储的零，则应使用eliminate_zeros()方法;例如：

>>> M = csr(np.random.random((1000,1000))>0.9, dtype=float)
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 99740 stored elements in Compressed Sparse Row format>

>>> M[:, 0] *= 0
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 99740 stored elements in Compressed Sparse Row format>

>>> M.eliminate_zeros()
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 99657 stored elements in Compressed Sparse Row format>

Scipy 可以在执行此类操作后自动调用eliminate_zeros例程，但开发人员选择在执行与更改稀疏结构一样昂贵的操作时为用户提供更大的灵活性和控制权

Answer 2

要重新创建代码（使用int类型以获得更紧凑的显示）：

In [16]: M = sparse.csr_matrix(np.random.random((8,8))>.7).astype(int)
In [17]: M
Out[17]: 
<8x8 sparse matrix of type '<class 'numpy.int32'>'
    with 17 stored elements in Compressed Sparse Row format>
In [18]: M.A
Out[18]: 
array([[0, 0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 1, 1, 1, 1, 0, 1]])
In [19]: M.tolil().data       # show nonzero values by row
Out[19]: 
array([list([1, 1]), list([1, 1]), list([1]), list([1, 1]), list([]),
       list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)

明确设置行（或列）。注意效率警告：

In [20]: M[0,:] = 0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
In [21]: M.tolil().data
Out[21]: 
array([list([0, 0, 0, 0, 0, 0, 0, 0]), list([1, 1]), list([1]),
       list([1, 1]), list([]), list([1, 1]), list([1, 1]),
       list([1, 1, 1, 1, 1, 1])], dtype=object)

所以是的，它已将行中的所有值都设置为指定值。并且它不会尝试区分设置0而不是1。您可以看到M.__setitem__和M._set_many中使用的代码（这是生成效率警告的地方）。

正如@jakevpd所示，您需要明确告诉它消除多余的0。在转让期间，它不会尝试这样做。

In [22]: M.eliminate_zeros()
In [23]: M.tolil().data
Out[23]: 
array([list([]), list([1, 1]), list([1]), list([1, 1]), list([]),
       list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)

通常不鼓励明确设置矩阵的值，尤其是csr。 coo甚至不允许这样做。如果您需要，lil是推荐的格式。

In [24]: Ml = M.tolil()
In [25]: Ml[1,:] = 0
In [26]: Ml.data
Out[26]: 
array([list([]), list([]), list([1]), list([1, 1]), list([]), list([1, 1]),
       list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)

lil确实会消除0。

将行乘以0的数组不会改变稀疏性。它也不是就地行动的。它产生一个新的矩阵：

In [29]: M[1,:].multiply(np.zeros((1,8)))
Out[29]: 
<1x8 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in COOrdinate format>
In [30]: _.A
Out[30]: array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [31]: M[1,:].A
Out[31]: array([[1, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)

使用稀疏矩阵进行乘法会消除0（同样，不是就地）：

In [32]: M[1,:].multiply(sparse.csr_matrix(np.zeros((1,8))))
Out[32]: 
<1x8 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>

（另请注意Out[29]和Out[32]之间的格式不同。）

作为一般规则，乘法（元素和矩阵）是使用csr矩阵的最有效操作，尤其是other也是稀疏的。事实上，行/列总和是使用矩阵乘法执行的，advanced索引也是如此。

在不改变稀疏度的情况下，将scipy稀疏矩阵相乘

2 个答案: