我正在尝试对大型稀疏矩阵(目前为12000 x 12000)进行操作。 我想要做的是将其块设置为零,但保持此块中的最大值。 我已经有了一个运行密集矩阵的解决方案:
import numpy as np
from scipy.sparse import random
np.set_printoptions(precision=2)
#x = random(10,10,density=0.5)
x = np.random.random((10,10))
x = x.T * x
print(x)
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.max(sub)
sub[sub < z] = 0
sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)
for i in range(1,len(sizes)):
current_i_min = sizes_sum[i-1]
current_i_max = sizes_sum[i]
for j in range(1,len(sizes)):
if i >= j:
continue
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
keep_only_max(current_j_min, current_j_max, current_i_min, current_i_max)
print(x)
然而,这对稀疏矩阵不起作用(尝试取消注释顶部的线)。 任何想法如何在不调用todense()的情况下有效地实现它?
答案 0 :(得分:1)
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.max(sub)
sub[sub < z] = 0
对于稀疏x
,sub
切片适用于csr
格式。它不会像等效的密集切片一样快,但会创建x
部分的副本。
我必须检查稀疏的max
函数。但我可以设想使用sub
属性上的coo
以及相应的np.argmax
和.data
值将row
格式转换为col
格式,构造一个相同形状但只有一个非零值的新矩阵。
如果您的广告块以常规,非重叠的方式覆盖x
,我建议使用sparse.bmat
构建新的矩阵。这基本上收集了所有组件的coo
属性,将它们连接到一组具有适当偏移的数组中,并生成一个新的coo
矩阵。
如果块分散或重叠,则可能需要生成,然后将它们逐个插回x
。 csr
格式应该适用于此,但它会发出稀疏的效率警告。 lil
应该更快地改变值。我认为它会接受块。
我可以想象用稀疏矩阵来做这件事,但是需要时间来设置测试用例并调试过程。
答案 1 :(得分:0)
感谢hpaulj,我设法使用scipy.sparse.bmat
实现了解决方案:
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix
from scipy.sparse import rand
from scipy.sparse import bmat
import numpy as np
np.set_printoptions(precision=2)
# my matrices are symmetric, so generate random symmetric matrix
x = rand(10,10,density=0.4)
x = x.T * x
x = x
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.unravel_index(sub.argmax(),sub.shape)
i1 = z[0]
j1 = z[1]
new = csr_matrix(([sub[i1,j1]],([i1],[j1])),shape=(b-a,d-c))
return new
def keep_all(a,b,c,d):
return x[a:b,c:d].copy()
# we want to create a chessboard pattern where the first central block is 1x1, the second 5x5 and the last 4x4
sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)
# acquire 2D array to store our chessboard blocks
r = range(len(sizes)-1)
blocks = [[0 for x in r] for y in r]
for i in range(1,len(sizes)):
current_i_min = sizes_sum[i-1]
current_i_max = sizes_sum[i]
for j in range(i,len(sizes)):
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
if i == j:
# keep the blocks at the diagonal completely
sub = keep_all(current_i_min, current_i_max, current_j_min, current_j_max)
blocks[i-1][j-1] = sub
else:
# the blocks not on the digonal only keep their maximum value
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
# we can leverage the matrix symmetry and only calculate one new matrix.
m1 = keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
m2 = m1.T
blocks[i-1][j-1] = m1
blocks[j-1][i-1] = m2
z = bmat(blocks)
print(z.todense())