我想迭代CSR矩阵的行并将每个元素除以行的总和,类似于此处:
我的问题是我正在处理一个大矩阵:(96582,350138)
当从链接的帖子应用操作时,它会膨胀我的记忆,因为返回的矩阵是密集的。
所以这是我的第一次尝试:
for row in counts:
row = row / row.sum()
不幸的是,这根本不会影响矩阵,所以我想出了第二个想法来创建一个新的csr矩阵并使用vstack连接行:
from scipy import sparse
import time
start_time = curr_time = time.time()
mtx = sparse.csr_matrix((0, counts.shape[1]))
for i, row in enumerate(counts):
prob_row = row / row.sum()
mtx = sparse.vstack([mtx, prob_row])
if i % 1000 == 0:
delta_time = time.time() - curr_time
total_time = time.time() - start_time
curr_time = time.time()
print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time))
这很有效,但经过一些迭代后,它变得越来越慢:
step: 0, total time: 0, delta_time: 0
step: 1000, total time: 1, delta_time: 1
step: 2000, total time: 5, delta_time: 4
step: 3000, total time: 12, delta_time: 6
step: 4000, total time: 23, delta_time: 11
step: 5000, total time: 38, delta_time: 14
step: 6000, total time: 55, delta_time: 17
step: 7000, total time: 88, delta_time: 32
step: 8000, total time: 136, delta_time: 47
step: 9000, total time: 190, delta_time: 53
step: 10000, total time: 250, delta_time: 59
step: 11000, total time: 315, delta_time: 65
step: 12000, total time: 386, delta_time: 70
step: 13000, total time: 462, delta_time: 76
step: 14000, total time: 543, delta_time: 81
step: 15000, total time: 630, delta_time: 86
step: 16000, total time: 722, delta_time: 92
step: 17000, total time: 820, delta_time: 97
有什么建议吗?知道为什么vstack变得越来越慢?
答案 0 :(得分:5)
vstack
是一个O(n)
操作,因为它需要为结果分配内存,然后将作为参数传递的所有数组的内容复制到结果数组中。
您只需使用multiply
进行操作:
>>> res = counts.multiply(1 / counts.sum(1)) # multiply with inverse
>>> res.todense()
matrix([[ 0.33333333, 0. , 0.66666667],
[ 0. , 0. , 1. ],
[ 0.26666667, 0.33333333, 0.4 ]])
但是使用np.lib.stride_tricks.as_strided
来完成你想要的操作(相对高效)也很容易。这个as_strided
函数还允许对数组执行更复杂的操作(如果您的情况没有方法或函数)。
例如,使用scipy documentation的示例csr:
>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>> row = np.array([0,0,1,2,2,2])
>>> col = np.array([0,2,2,0,1,2])
>>> data = np.array([1.,2,3,4,5,6])
>>> counts = csr_matrix( (data,(row,col)), shape=(3,3) )
>>> counts.todense()
matrix([[ 1., 0., 2.],
[ 0., 0., 3.],
[ 4., 5., 6.]])
您可以将每行除以它的总和,如下所示:
>>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr,
shape=(counts.shape[0], 2),
strides=2*counts.indptr.strides)
>>> for start, stop in row_start_stop:
... row = counts.data[start:stop]
... row /= row.sum()
>>> counts.todense()
matrix([[ 0.33333333, 0. , 0.66666667],
[ 0. , 0. , 1. ],
[ 0.26666667, 0.33333333, 0.4 ]])
答案 1 :(得分:2)
@MSeifert回答更有效率,这应该是正确的做法。我认为写counts[i, :]
意味着完成了一些列切片,我没有意识到。文档明确说明这些对csr_matrix来说是非常低效的操作。方式确实是一个很好的例子。
该文档称行切片效率很高,我认为你应该这样做
for i in range(counts.shape[0]):
counts[i,:] /= counts[i,:].sum()
这样你可以编辑你的矩阵,它保持稀疏,你不必使用vstack。我不确定它是最有效的操作,但至少你不应该有内存问题,并且在计算行时没有减速效果:
import time()
s = time.time()
for i in range(counts.shape[0]):
counts[i, :] /= (counts[i, :].sum() + 1)
if i % 1000 == 0:
e = time.time()
if i > 0:
print i, e-s
s = time.time()
1000 6.00199794769
2000 6.02894091606
3000 7.44459486008
4000 7.10011601448
5000 6.16998195648
6000 7.79510307312
7000 7.00139117241
8000 7.37821507454
9000 7.28075814247
...
row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2),
strides=2*counts.indptr.strides)
for i, (start, stop) in enumerate(row_start_stop):
row = counts.data[start:stop]
row /= row.sum()
if i % 1000 == 0:
e = time.time()
if i > 0:
print i,e-s
s = time.time()
1000 0.00735783576965
2000 0.0108380317688
3000 0.0102109909058
4000 0.0131571292877
5000 0.00670218467712
6000 0.00608897209167
7000 0.00663685798645
8000 0.0164499282837
9000 0.0061981678009
...
至于为什么使用vstack
很慢,@ MSeifert答案很棒。