对于稀疏的scipy矩阵,将行设置为零的最有效方法是什么?

时间:2013-11-05 08:48:39

标签: python matlab numpy scipy sparse-matrix

我正在尝试将以下MATLAB代码转换为Python,并且无法找到在任何合理时间内都能正常工作的解决方案。

M = diag(sum(a)) - a;
where = vertcat(in, out);
M(where,:) = 0;
M(where,where) = 1;

这里,a是稀疏矩阵,其中是向量(如进/出)。我使用Python的解决方案是:

M = scipy.sparse.diags([degs], [0]) - A
where = numpy.hstack((inVs, outVs)).astype(int)
M = scipy.sparse.lil_matrix(M)
M[where, :] = 0  # This is the slowest line
M[where, where] = 1
M = scipy.sparse.csc_matrix(M)

但由于A是334863x334863,这需要三分钟。如果有人对如何加快速度有任何建议,请提供帮助!为了进行比较,MATLAB在不知不觉中快速完成了同样的步骤。

谢谢!

3 个答案:

答案 0 :(得分:8)

对alko / seberg的方法略有不同。我发现循环令人不安,所以我花了今天早上的大部分时间想办法摆脱它。以下并不总是比其他方法更快。它表现得越好,越多的行被归零,矩阵越稀疏:

def csr_zero_rows(csr, rows_to_zero):
    rows, cols = csr.shape
    mask = np.ones((rows,), dtype=np.bool)
    mask[rows_to_zero] = False
    nnz_per_row = np.diff(csr.indptr)

    mask = np.repeat(mask, nnz_per_row)
    nnz_per_row[rows_to_zero] = 0
    csr.data = csr.data[mask]
    csr.indices = csr.indices[mask]
    csr.indptr[1:] = np.cumsum(nnz_per_row)

试驾两种方法:

rows, cols = 334863, 334863
a = sps.rand(rows, cols, density=0.00001, format='csr')
b = a.copy()
rows_to_zero = np.random.choice(np.arange(rows), size=10000, replace=False)

In [117]: a
Out[117]: 
<334863x334863 sparse matrix of type '<type 'numpy.float64'>'
    with 1121332 stored elements in Compressed Sparse Row format>

In [118]: %timeit -n1 -r1 csr_rows_set_nz_to_val(a, rows_to_zero)
1 loops, best of 1: 75.8 ms per loop

In [119]: %timeit -n1 -r1 csr_zero_rows(b, rows_to_zero)
1 loops, best of 1: 32.5 ms per loop

当然:

np.allclose(a.data, b.data)
Out[122]: True

np.allclose(a.indices, b.indices)
Out[123]: True

np.allclose(a.indptr, b.indptr)
Out[124]: True

答案 1 :(得分:7)

我用于类似任务的解决方案attributes to @seberg并且不会转换为lil格式:

import scipy.sparse
import numpy
import time

def csr_row_set_nz_to_val(csr, row, value=0):
    """Set all nonzero elements (elements currently in the sparsity pattern)
    to the given value. Useful to set to 0 mostly.
    """
    if not isinstance(csr, scipy.sparse.csr_matrix):
        raise ValueError('Matrix given must be of CSR format.')
    csr.data[csr.indptr[row]:csr.indptr[row+1]] = value

def csr_rows_set_nz_to_val(csr, rows, value=0):
    for row in rows:
        csr_row_set_nz_to_val(csr, row)
    if value == 0:
        csr.eliminate_zeros()

用时间包装您的评估

def evaluate(size):
    degs = [1]*size
    inVs = list(xrange(1, size, size/25))
    outVs = list(xrange(5, size, size/25))
    where = numpy.hstack((inVs, outVs)).astype(int)
    start_time = time.time()
    A = scipy.sparse.csc_matrix((size, size))
    M = scipy.sparse.diags([degs], [0]) - A
    csr_rows_set_nz_to_val(M, where)
    return time.time()-start_time

并测试其表现:

>>> print 'elapsed %.5f seconds' % evaluate(334863)
elapsed 0.53054 seconds

答案 2 :(得分:0)

如果您不喜欢在稀疏矩阵的内部进行挖掘,您也可以使用对角矩阵的稀疏矩阵乘法:

def zero_rows(M, rows_to_zero):

    ixs = numpy.ones(M.shape[0], int)
    ixs[rows_to_zero] = 0
    D = sparse.diags(ixs)
    res = D * M
    return res