Question

我想重复scipy csr稀疏矩阵的行，但是当我尝试调用numpy的重复方法时，它只是将稀疏矩阵视为一个对象，并且只会将其重复为一个对象。 ndarray。我查看了文档，但是我找不到任何实用程序来重复scipy csr稀疏矩阵的行。

我编写了以下代码，对内部数据进行操作，这似乎有效。

def csr_repeat(csr, repeats):
    if isinstance(repeats, int):
        repeats = np.repeat(repeats, csr.shape[0])
    repeats = np.asarray(repeats)
    rnnz = np.diff(csr.indptr)
    ndata = rnnz.dot(repeats)
    if ndata == 0:
        return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
                                 dtype=csr.dtype)
    indmap = np.ones(ndata, dtype=np.int)
    indmap[0] = 0
    rnnz_ = np.repeat(rnnz, repeats)
    indptr_ = rnnz_.cumsum()
    mask = indptr_ < ndata
    indmap -= np.int_(np.bincount(indptr_[mask],
                                  weights=rnnz_[mask],
                                  minlength=ndata))
    jumps = (rnnz * repeats).cumsum()
    mask = jumps < ndata
    indmap += np.int_(np.bincount(jumps[mask],
                                  weights=rnnz[mask],
                                  minlength=ndata))
    indmap = indmap.cumsum()
    return sparse.csr_matrix((csr.data[indmap],
                              csr.indices[indmap],
                              np.r_[0, indptr_]),
                             shape=(np.sum(repeats), csr.shape[1]))

并且效率相当高，但我宁愿不修补这个课程。有更好的方法吗？

修改

当我重新审视这个问题时，我想知道为什么我首先发布它。几乎所有我想用重复矩阵做的事情都会更容易处理原始矩阵，然后再应用重复。我的假设是，重复发布总是比任何可能的答案更好地解决这个问题。

Answer 1

from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row

Answer 2

np.repeat不起作用并不奇怪。它将操作委托给硬编码的a.repeat方法，如果失败，首先将a转换为数组（对象，如果需要）。

在开发稀疏代码的线性代数世界中，大多数装配工作都是在创建稀疏矩阵之前在row，col，data数组上完成的。重点是有效的数学运算，而不是添加/删除/索引行和元素。

我没有完成您的代码，但我对csr格式矩阵需要做很多工作并不感到惊讶。

我为lil格式设计了一个类似的函数（从lil.copy开始）：

def lil_repeat(S, repeat):
    # row repeat for lil sparse matrix
    # test for lil type and/or convert
    shape=list(S.shape)
    if isinstance(repeat, int):
        shape[0]=shape[0]*repeat
    else:
        shape[0]=sum(repeat)
    shape = tuple(shape)
    new = sparse.lil_matrix(shape, dtype=S.dtype)
    new.data = S.data.repeat(repeat) # flat repeat
    new.rows = S.rows.repeat(repeat)
    return new

但也可以重复使用指数。 lil和csr都支持接近常规numpy数组的索引（至少在足够新的版本中）。因此：

S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A

给出相同的结果

最重要的是什么？

print S[np.arange(3).repeat([1,2,3]),:].A

Answer 3

有人发布了一个非常聪明的回答，告诉我如何最好地做到这一点，我重新审视了我原来的问题，看看是否有更好的方法。我提出了另一种有利有弊的方法。我们可以反而指示scipy重复使用重复行的数据，创建类似于原始稀疏数组视图的东西（正如您可能使用{{}那样，而不是重复所有数据（就像接受的答案所做的那样）。 3}}）。这可以通过简单地平铺indptr字段来完成。

repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))

此技术重复向量repeat_num次，而只修改indptr。缺点是由于csr矩阵对数据进行编码的方式，而不是创建维度为repeat_num x n的矩阵，而是创建一个(2 * repeat_num - 1) x n，其中每个奇数行都是这不应该是一个太大的交易，因为任何操作都会很快，因为每一行都是0，并且它们应该很容易在之后切片（类似[::2]），但它＆＃39;不理想。

我认为明确的答案可能仍然是最好的＆＃34;这样做的方法。

Answer 4

重复稀疏矩阵的最有效方法之一是 OP 建议的方式。我修改了 indptr 使其不输出 0 行。

## original sparse matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
x = scipy.sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
x.toarray()

array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]])

要重复此操作，您需要重复数据和索引，并且需要修复 indptr。这不是最优雅的方式，但确实有效。

## repeated sparse matrix
repeat = 5 
new_indptr = indptr
for r in range(1,repeat):
    new_indptr = np.concatenate((new_indptr, new_indptr[-1]+indptr[1:]))
x = scipy.sparse.csr_matrix((np.tile(data,repeat), np.tile(indices,repeat), new_indptr))
x.toarray()

array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]])

沿轴0重复一个scipy csr稀疏矩阵

修改

4 个答案: