我有一个包含计数的压缩稀疏行矩阵。我想构建一个包含这些计数的预期频率的矩阵。这是我目前正在使用的代码:
from scipy.sparse import coo_matrix
#m is a csr_matrix
col_total = m.sum(axis=0)
row_total = m.sum(axis=1)
n = int(col_total.sum(axis=1))
A = coo_matrix(m)
for i,j in zip(A.row,A.col):
m[i,j]= col_total.item(j)*row_total.item(i)/n
这适用于小矩阵。在更大的矩阵(> 1Gb)上,for循环需要几天才能运行。有什么方法可以让它更快吗?
答案 0 :(得分:2)
m.data = (col_total[:,A.col].A*(row_total[A.row,:].T.A)/n)[0]
是计算m.data
的完全向量化方式。它可能可以清理一下。 col_total
为matrix
,因此逐个元素执行需要一些额外的语法。
我将证明:
In [37]: m=sparse.rand(10,10,.1,'csr')
In [38]: col_total=m.sum(axis=0)
In [39]: row_total=m.sum(axis=1)
In [40]: n=int(col_total.sum(axis=1))
In [42]: A=m.tocoo()
In [46]: for i,j in zip(A.row,A.col):
....: m[i,j]= col_total.item(j)*row_total.item(i)/n
....:
In [49]: m.data
Out[49]:
array([ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 ,
0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632])
In [51]: col_total[:,A.col].A*(row_total[A.row,:].T.A)/n
Out[51]:
array([[ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 ,
0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632]])
In [53]: (col_total[:,A.col].A*(row_total[A.row,:].T.A)/n)[0]
Out[53]:
array([ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 ,
0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632])
答案 1 :(得分:1)
要在 @hpaulj 的答案上稍微扩展一下,您可以通过直接从预期频率和行/创建输出矩阵来摆脱for
循环m
中的非零元素的列索引:
from scipy import sparse
import numpy as np
def fast_efreqs(m):
col_total = np.array(m.sum(axis=0)).ravel()
row_total = np.array(m.sum(axis=1)).ravel()
# I'm casting this to an int for consistency with your version, but it's
# not clear to me why you would want to do this...
grand_total = int(col_total.sum())
ridx, cidx = m.nonzero() # indices of non-zero elements in m
efreqs = row_total[ridx] * col_total[cidx] / grand_total
return sparse.coo_matrix((efreqs, (ridx, cidx)))
为了比较,这里将原始代码作为函数:
def orig_efreqs(m):
col_total = m.sum(axis=0)
row_total = m.sum(axis=1)
n = int(col_total.sum(axis=1))
A = sparse.coo_matrix(m)
for i,j in zip(A.row,A.col):
m[i,j]= col_total.item(j)*row_total.item(i)/n
return m
在小矩阵上测试等价:
m = sparse.rand(100, 100, density=0.1, format='csr')
print((orig_efreqs(m.copy()) != fast_efreqs(m)).nnz == 0)
# True
更大矩阵的基准表现:
In [1]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr')
.....: orig_efreqs(m)
.....:
1 loops, best of 3: 2min 25s per loop
In [2]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr')
.....: fast_efreqs(m)
.....:
10 loops, best of 3: 38.3 ms per loop