Question

我有一个M行和N列的稀疏矩阵，我想要连接K个额外的NULL列，所以我的对象现在有M行和（N + K）列。棘手的部分是我还有一个长度为N的列表，其范围从0到N + K，表示每个列在新矩阵中应该具有的位置。

所以例如，如果N = 2，K = 1并且索引列表是[2,0]，这意味着我想从我的MxN矩阵中取最后一列作为第一列，引入一个null列，然后将我的第一列作为最后一列。

我试图使用以下代码 - 当我已经有x但我无法在此处上传。

import numpy as np
from scipy import sparse
M = 5000
N = 10
pad_factor = 1.2
size = int(pad_factor * N)
x = sparse.random(m = M, n = N, density = 0.1, dtype = 'float64')
indeces = np.random.choice(range(size), size=N, replace=False)
null_mat = sparse.lil_matrix((M, size))
null_mat[:, indeces] = x

问题是，对于N = 1,500,000，P = 5,000和K = 200，此代码不会缩放，它会给我一个内存错误。确切的错误是：＆＃34;返回np.zeros（self.shape，dtype = self.dtype，order = order）MemoryError＆＃34;。

我只是想添加一些空列，所以我想我的切片想法是低效的，特别是当K＆lt;＆lt; N在我的真实数据中。在某种程度上，我们可以将此视为合并排序问题 - 我有一个非null和null数据集，我想连接它们，但按特定顺序。关于如何使其发挥作用的任何想法？

谢谢！

Answer 1

正如我在评论中推断的那样，内存错误是在

中产生的

null_mat[:, indeces] = x

行是因为lil __setitem__方法，x.toarray()，即它首先将x转换为密集数组。将稀疏矩阵直接映射到索引lil可能更节省空间，但编码工作要多得多。 lil针对迭代分配进行了优化，而非大规模矩阵映射。

sparse.hstack使用sparse.bmat加入稀疏矩阵。这将所有输入转换为coo，然后将它们的属性组合成一个新集合，从中构建新矩阵。

直接铜矩阵构造

经过相当多的游戏后，我发现以下简单操作有效：

In [479]: z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))

In [480]: z1
Out[480]: 
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in COOrdinate format>

将其与x和null_mat：

进行比较

In [481]: x
Out[481]: 
<5000x10 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in COOrdinate format>
In [482]: null_mat
Out[482]: 
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in LInked List format>

测试稀疏矩阵的相等性可能很棘手。特别是coo值可以按任何顺序发生，例如x生成的sparse.random。

但csr格式对行进行排序，因此indptr属性的这种比较是一个非常好的相等测试：

In [483]: np.allclose(null_mat.tocsr().indptr, z1.tocsr().indptr)
Out[483]: True

时间测试：

In [477]: timeit z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
108 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [478]: 
In [478]: timeit null_mat[:, indeces] = x
3.05 ms ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

矩阵乘法方法

使用矩阵乘法完成带有列表的

csr格式索引。它构造了一个extractor矩阵，并应用它。矩阵乘法是csr_matrix强点。

我们可以用同样的方式执行重新排序：

In [489]: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)), shape=(10,12))
In [490]: I
Out[490]: 
<10x12 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [496]: w1=x*I

比较这些矩阵的密集等价物：

In [497]: np.allclose(null_mat.A, z1.A)
Out[497]: True
In [498]: np.allclose(null_mat.A, w1.A)
Out[498]: True


In [499]: %%timeit
     ...: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)),shape=(10,
     ...: 12))
     ...: w1=x*I
1.11 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这比lil索引方法更好，但仍然比直接coo矩阵构造慢得多。虽然公平，但我们应该从csr样式输入构造一个coo矩阵。转换需要一些时间：

In [502]: timeit z2=sparse.csr_matrix((x.data, (x.row, indeces[x.col])),shape=(M
     ...: ,size))
639 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

错误追溯

MemoryError回溯应该已经显示此索引赋值中发生了错误，并且相关方法调用是：

Signature: null_mat.__setitem__(index, x)
Source:   
    def __setitem__(self, index, x):
       ....
       if isspmatrix(x):
           x = x.toarray()
       ...

Signature: x.toarray(order=None, out=None)
Source:   
    def toarray(self, order=None, out=None):
        """See the docstring for `spmatrix.toarray`."""
        B = self._process_toarray_args(order, out)
Signature: x._process_toarray_args(order, out)
Source:   
    def _process_toarray_args(self, order, out):
        ...
        return np.zeros(self.shape, dtype=self.dtype, order=order)

我是通过scipy github上的np.zeros来电进行代码搜索找到的。

以特定顺序将空列插入到scipy稀疏矩阵中

1 个答案:

直接铜矩阵构造

矩阵乘法方法

错误追溯