我正在尝试构建一个逐点互信息矩阵。我有一个60k乘60k scipy矩阵的单词共现,我想把它转换成另一个稀疏矩阵,其中入口i,j对应log(p(i,j)/ p(i)* p(j)) ,对于单词i和j。我删除正值以获得PPMI矩阵。我正在寻找一种有效的方法来迭代第一个矩阵来生成第二个矩阵,而不需要占用太多内存。
我尝试使用第一个矩阵的副本并对其进行迭代,并逐行构建新的CSR矩阵,在2个稀疏矩阵上使用vstack添加每个新行。由于内存错误,这两个进程都被终止。构建此矩阵的最佳方法是什么,然后将其保存以便以后重用?
from scipy.sparse import vstack
from scipy import sparse
if(inplace):
for i in range(ctxt_matrix.shape[0]): #row-wise operation
#for each row (word vector), reweigh this in 3 steps:
# 1. get the probability of this context, instead of the raw count (divide by total words)
# 2. divide this probability by the probability of this row/context occurring together randomly (multiply entry
# for word all the other words, do element wise division)
# 3. take the log of this division, and reassign the row to this.
row_pmi = np.log(np.divide((ctxt_matrix[i].toarray().T/total_words),(word_probas*word_probas[i]))).T
if(cutoff_0):
row_pmi[row_pmi<0] = 0 #0 cutoff
ctxt_matrix[i, :] = row_pmi
print('PMI matrix building took:', time.time()-start)
return ctxt_matrix
else:
#same as above, but on a new matrix, using vstack.
pmi_matrix = scipy.sparse.csr_matrix((1, ctxt_matrix.shape[1]))
for i in range(ctxt_matrix.shape[0]): #row-wise operation
row_pmi = scipy.sparse.csr_matrix(np.log(np.divide( ((ctxt_matrix[i].toarray().T)/total_words) , word_probas*word_probas[i] )).T)
if(cutoff_0):
row_pmi[row_pmi<0] = 0 #0 cutoff
pmi_matrix = scipy.vstack((pmi_matrix, row_pmi))
del row_pmi
print('PMI matrix building took:', time.time()-start)
return pmi_matrix
TL; DR - 我需要进行逐行操作,通过迭代另一个来创建稀疏矩阵。这里有一些简化的代码,用于了解我正在做的事情:
from scipy import sparse
import time
start = time.time()
ctxt_matrix = scipy.sparse.csr_matrix(scipy.sparse.rand(5000, 5000))
for i in range(ctxt_matrix.shape[0]):
row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
row_pmi[row_pmi<0] = 0 # don't store negatives in memory
ctxt_matrix[i,:] = scipy.sparse.csr_matrix(row_pmi).T
ctxt_matrix[i, :].eliminate_zeros()
print('PMI matrix building took:', time.time()-start)
答案 0 :(得分:0)
我尝试了一些代码变体:
import numpy as np
from scipy.sparse import vstack
from scipy import sparse
n, m = 10, 50000
source = sparse.random(n,m, 0.2, format='csr')*5000
print(repr(source))
ctxt_matrix = source.copy()
for i in range(ctxt_matrix.shape[0]):
print(ctxt_matrix[i,:].nnz, end=' ')
row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
row_pmi[row_pmi<0] = 0 # don't store negatives in memory
temp = sparse.csr_matrix(row_pmi).T
print(temp.nnz)
ctxt_matrix[i,:] = temp
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))
print('\nrow lil')
ctxt_matrix = source.tolil()
for i in range(ctxt_matrix.shape[0]):
print(ctxt_matrix[i,:].nnz, end=' ')
row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
row_pmi[row_pmi<0] = 0 # don't store negatives in memory
temp = sparse.lil_matrix(row_pmi).T
print(temp.nnz)
ctxt_matrix[i,:] = temp
print(repr(ctxt_matrix))
print('\nrow lil data')
ctxt_matrix = source.tolil()
for i in range(ctxt_matrix.shape[0]):
data = np.array(ctxt_matrix.data[i])
print(len(data))
data = np.log(data/500) #some row-wise operation on the other matrix
data[data<0] = 0 # don't store negatives in memory
ctxt_matrix.data[i][:] = data
#print(repr(ctxt_matrix))
ctxt_matrix = ctxt_matrix.tocsr()
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))
print('\nwhole csr data')
ctxt_matrix = source.copy()
data = ctxt_matrix.data
data = np.log(data/500)
data[data<0] = 0
ctxt_matrix.data[:] = data
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))
结果
1407:~/mypy$ python3 stack47615473.py
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
with 100000 stored elements in Compressed Sparse Row format>
stack47615473.py:12: RuntimeWarning: divide by zero encountered in log
row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
10069 9081
9931 8943
10159 9134
10069 9043
9940 8924
9961 9009
9941 8939
9935 8923
9943 8983
10052 9072
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
with 90051 stored elements in Compressed Sparse Row format>
row lil
stack47615473.py:24: RuntimeWarning: divide by zero encountered in log
row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
10069 9081
9931 8943
10159 9134
10069 9043
9940 8924
9961 9009
9941 8939
9935 8923
9943 8983
10052 9072
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
with 90051 stored elements in LInked List format>
row lil data
10069
9931
10159
10069
9940
9961
9941
9935
9943
10052
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
with 90051 stored elements in Compressed Sparse Row format>
whole csr data
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
with 90051 stored elements in Compressed Sparse Row format>
lil row
次迭代比csr
次慢。
lil
和csr
数据操作几乎是即时的。
还有一种方法可以直接迭代data
格式的csr
。这需要使用indptr
属性中的值对其进行索引。在之前的SO问题中已经讨论过这个问题(可能会出现问题)。
csr
行迭代有点慢,因为它每次都必须构造一个新的csr
矩阵。 toarray
步骤有点慢。如果您只能操作行或矩阵的非零data
值,则速度会更快。
这不能解决高内存使用问题。我希望对矩阵的内部更改使用更少的内存,而重复的vstack
使用很多。我想知道,矩阵是如此之大,只是构造它的副本会产生内存错误吗?