Question

我正在创建一个共存矩阵，其大小为1M×1M整数。在创建矩阵之后，我将对其进行的唯一操作是每行（或列）获得前N个值。因为它是对称矩阵。

我必须创建稀疏矩阵才能将其放入内存中。我从一个大文件中读取输入数据，并逐步更新两个索引（row，col）的共同出现。

Sparse dok_matrix的示例代码指定我应该事先声明矩阵的大小。我知道我的矩阵的上边界（1米乘1米），但实际上它可能少于那个。我是否必须事先指定大小，还是可以逐步创建它？

import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
    for j in range(5):
        S[i, j] = i + j    # Update element

Answer 1

几天前的一个问题creating sparse matrix of unknown size，讨论了从文件读取的数据创建稀疏矩阵。 OP希望使用lil格式;我建议为coo格式构建输入数组。

在其他SO问题中，我发现将值添加到普通字典比将其添加到dok矩阵更快 - 即使dok是字典子类。 dok索引方法中存在相当多的问题。在某些情况下，我建议使用元组键构建dict，并使用update将值添加到已定义的dok。但我怀疑在你的情况下coo路线更好。

dok和lil是增量构造的最佳格式，但与python list和dict方法相比，这两种格式都不是很好。

关于每一行的top N values，我记得要探索一下，但是回过头来一段时间，所以不能用手拉出一个好的SO问题。您可能需要一种面向行的格式，例如lil或csr。

关于问题 - ＆＃39;您是否需要在创建时指定尺寸＆＃39;。是。由于稀疏矩阵（无论格式如何）仅存储非零值，因此创建过大的矩阵几乎没有什么害处。

我无法想出dok或coo格式的任何内容，这些格式取决于shape - 至少在数据存储或创建方面没有。 lil和csr会有一些额外的值。如果你真的需要探索这个，请阅读如何存储值，并使用小矩阵。

==================

看起来dok格式的所有代码都是Python中的

/usr/lib/python3/dist-packages/scipy/sparse/dok.py

正在扫描该文件，我发现dok确实有resize方法

d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.

Any non-zero elements that lie outside the new shape are removed.
File:      /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type:      method

因此，如果您要将矩阵初始化为1M x 1M并调整为100 x 100，则可以执行此操作 - 它将逐步执行所有键以确保新的除外范围。所以它并不便宜，即使主要动作是改变形状参数。

    newM, newN = shape
    M, N = self.shape
    if newM < M or newN < N:
        # Remove all elements outside new dimensions
        for (i, j) in list(self.keys()):
            if i >= newM or j >= newN:
                del self[i, j]
    self._shape = shape

如果你确定没有任何外键，你可以直接改变形状。其他稀疏格式没有resize方法。

In [31]: d=sparse.dok_matrix((10,10),int)

In [32]: d
Out[32]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

In [33]: d.resize((5,5))

In [34]: d
Out[34]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

In [35]: d._shape=(9,9)

In [36]: d
Out[36]: 
<9x9 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

另见：

Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

Get top-n items of every row in a scipy sparse matrix

如何在python上逐步创建稀疏矩阵？

1 个答案: