Question

我有大约10,000个稀疏矩阵，每个稀疏矩阵的大小为50,000x5，平均密度为0.0004。对于每个循环（10000次），我正在计算numpy数组并将其转换为csr_matrix并将其附加到列表中。但是内存消耗与附加numpy数组一样高，但不是附加csr_matrices。

如何在内存中使用这些10K稀疏矩阵进行进一步计算时减少内存消耗？

示例代码：

from scipy.sparse import csr_matrix
import numpy as np
sparse_matrices = []

for i in range(10000):
    np_array = get_np_array()
    sparse_matrix = csr_matrix(np_array)
    sparse_matrices.append(sparse_matrix)
    print np_array.nbytes, sparse_matrix.data.nbytes, repr(sparse_matrix)

会输出类似的东西，这表明我正在附加压缩矩阵。但是，内存的增长与增加numpy矩阵一样。

1987520 520 <49688x5 sparse matrix of type '<type 'numpy.float64'>'
    with 65 stored elements in Compressed Sparse Row format>
1987520 512 <49688x5 sparse matrix of type '<type 'numpy.float64'>'
    with 64 stored elements in Compressed Sparse Row format>

刚才意识到如果我使用coo_matrix代替csr_matrix，内存消耗是合理的。如果那是csr_matrix内存大约〜8gb。

Answer 1

对于矩阵：

<49688x5 sparse matrix of type '<type 'numpy.float64'>'
with 65 stored elements in Compressed Sparse Row format>

采用coo格式，关键属性为row，col和data，共有65个元素。 data是浮点数，其他是整数（行和列索引）。

csr格式row属性替换为indptr，每行有一个值（加1？）。使用此形状indptr长度为49688个元素。如果是csc格式indptr，则只有5个元素。

csr通常比coo更紧凑。但在你的情况下，有许多空行;所以它要大得多。如果它是单行矩阵，csr将特别紧凑;如果它是列向量，则根本不紧凑。

如何在内存中有效存储可变数量的scipy sparse.csr_matrix？

1 个答案: