Question

这是FEM / FVM方程系统的典型用例，因此可能具有更广泛的意义。从三角形网格

我想创建一个scipy.sparse.csr_matrix。矩阵行/列表示网格节点处的值。矩阵在主对角线上有条目，两个节点通过边连接。

这是一个MWE，它首先构建一个node-＆gt; edge-＆gt;单元格关系，然后构建矩阵：

import numpy
import meshzoo
from scipy import sparse

nx = 1600
ny = 1000
verts, cells = meshzoo.rectangle(0.0, 1.61, 0.0, 1.0, nx, ny)

n = len(verts)

nds = cells.T
nodes_edge_cells = numpy.stack([nds[[1, 2]], nds[[2, 0]],nds[[0, 1]]], axis=1)

# assign values to each edge (per cell)
alpha = numpy.random.rand(3, len(cells))
vals = numpy.array([
    [alpha**2, -alpha],
    [-alpha, alpha**2],
    ])

# Build I, J, V entries for COO matrix
I = []
J = []
V = []
#
V.append(vals[0][0])
V.append(vals[0][1])
V.append(vals[1][0])
V.append(vals[1][1])
#
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[1])
I.append(nodes_edge_cells[1])
#
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
# Create suitable data for coo_matrix
I = numpy.concatenate(I).flat
J = numpy.concatenate(J).flat
V = numpy.concatenate(V).flat

matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
matrix = matrix.tocsr()

用

python -m cProfile -o profile.prof main.py
snakeviz profile.prof

可以创建和查看上述的个人资料：

方法tocsr()占据了运行时的大部分份额，但在构建alpha更复杂时也是如此。因此，我正在寻找提高速度的方法。

我已经找到了：

由于数据结构的原因，矩阵对角线上的值可以预先加总，即

V.append(vals[0, 0, 0] + vals[1, 1, 2])
I.append(nodes_edge_cells[0, 0])  # == nodes_edge_cells[1, 2]
J.append(nodes_edge_cells[0, 0])  # == nodes_edge_cells[1, 2]

这会缩短I，J，V，从而加快tocsr。

现在，边缘是“每个细胞”。我可以使用numpy.unique识别彼此相等的边缘，有效地节省了约I，J，V的一半。但是，我发现这也需要一些时间。（这并不奇怪。）

另外一个想法是，如果有{{1}，我可以用简单的V替换对角线I，J，numpy.add.at类似于主对角线分开保存的数据结构。我知道这存在于其他一些软件包中，但在scipy中找不到它。正确的吗？

或许有一种直接构建CSR的明智方法？

Answer 1

我会尝试直接创建csr结构，特别是如果你使用np.unique，因为这会给你排序的键，这是完成工作的一半。

我假设您已经i, j按字典顺序排序，v使用np.add.at对inverse的{{1}}输出进行求和。 1}}。

然后np.unique和v已经采用csr格式。剩下要做的就是创建j，只需indptr np.searchsorted(i, np.arange(M+1))，其中M是列长。您可以将这些直接传递给sparse.csr_matrix构造函数。

好的，让代码说：

import numpy as np
from scipy import sparse
from timeit import timeit


def tocsr(I, J, E, N):
    n = len(I)
    K = np.empty((n,), dtype=np.int64)
    K.view(np.int32).reshape(n, 2).T[...] = J, I  
    S = np.argsort(K)
    KS = K[S]
    steps = np.flatnonzero(np.r_[1, np.diff(KS)])
    ED = np.add.reduceat(E[S], steps)
    JD, ID = KS[steps].view(np.int32).reshape(-1, 2).T
    ID = np.searchsorted(ID, np.arange(N+1))
    return sparse.csr_matrix((ED, np.array(JD, dtype=int), ID), (N, N))

def viacoo(I, J, E, N):
    return sparse.coo_matrix((E, (I, J)), (N, N)).tocsr()


#testing and timing

# correctness
N = 1000
A = np.random.random((N, N)) < 0.001
I, J = np.where(A)

E = np.random.random((2, len(I)))
D = np.zeros((2,) + A.shape)
D[:, I, J] = E
D2 = tocsr(np.r_[I, I], np.r_[J, J], E.ravel(), N).A

print('correct:', np.allclose(D.sum(axis=0), D2))

# speed
N = 100000
K = 10

I, J = np.random.randint(0, N, (2, K*N))
E = np.random.random((2 * len(I),))
I, J, E = np.r_[I, I, J, J], np.r_[J, J, I, I], np.r_[E, E]

print('N:', N, ' --  nnz (with duplicates):', len(E))
print('direct: ', timeit('f(a,b,c,d)', number=10, globals={'f': tocsr, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')
print('via coo:', timeit('f(a,b,c,d)', number=10, globals={'f': viacoo, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')

打印：

correct: True
N: 100000  --  nnz (with duplicates): 4000000
direct:  7.702431229001377 secs for 10 iterations
via coo: 41.813509466010146 secs for 10 iterations

加速：5x

Answer 2

所以，最终结果是COO和CSR的sum_duplicates之间的区别（就像@hpaulj怀疑的那样）。感谢所有参与者（特别是@ paul-panzer）的努力，a PR正在进行中，以tocsr获得极大的加速。

SciPy的tocsr在lexsort上执行了(I, J)，因此它有助于以(I, J)已经公平排序的方式组织索引。

对于上述示例中的nx=4，ny=2，I和J是

[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]
[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]

首先排序cells的每一行，然后排序第一列的行，如

cells = numpy.sort(cells, axis=1)
cells = cells[cells[:, 0].argsort()]

产生

[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]
[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]

对于原始帖子中的数字，排序会将运行时间从大约40秒减少到8秒。

如果节点首先编号更合适，也许可以实现更好的排序。我正在考虑Cuthill-McKee和friends。

有效地构建FEM / FVM矩阵

2 个答案: