我想计算两个巨型稀疏矩阵的每一对行之间的余弦相似度。传统函数为所有成对的行计算,就我而言,甚至更“容易”,我没有找到实现这一点的实现。
例如:结果将是cos(A[row i], B[row i]) for all i in N
的数组。
我已经尝试过对每一行使用map
,但是速度真的很慢。现在,我一直在尝试使用矩阵运算来计算所有行。
import numpy as np
import scipy.sparse as sp
A = sp.csr_matrix((N,N))
B = sp.csr_matrix((N,N))
A_norm = sp.dok_matrix(A.shape)
A_norm[A.nonzero()] = A[A.nonzero()] / A[A.nonzero()].sum(axis=0)
B_norm = sp.dok_matrix(B.shape)
B_norm[B.nonzero()] = B[B.nonzero()] / B[B.nonzero()].sum(axis=0)
AB = (A_norm*B_norm).sum(axis=0)
AA = np.sum(np.sqrt( A_norm*A_norm ), axis=0)
BB = np.sum(np.sqrt( B_norm*B_norm ), axis=0)
AA_BB = (AA*BB)
cos = sp.dok_matrix(AB.shape)
cos[AA_BB.nonzero()] = AB[AA_BB.nonzero()] / AA_BB[AA_BB.nonzero()]
当我在上面的代码中计算(AA * BB)时发生错误。堆栈是:
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-244-00be56e17413> in <module>()
9 AA = np.sum(np.sqrt( A_norm*A_norm ), axis=0)
10 BB = np.sum(np.sqrt( B_norm*B_norm ), axis=0)
---> 11 AA_BB = (AA*BB)
12
13 cos = sp.dok_matrix(AB.shape)
/usr/local/lib/python3.6/dist-packages/numpy/matrixlib/defmatrix.py in __mul__(self, other)
218 if isinstance(other, (N.ndarray, list, tuple)) :
219 # This promotes 1-D vectors to row vectors
--> 220 return N.dot(self, asmatrix(other))
221 if isscalar(other) or not hasattr(other, '__rmul__') :
222 return N.dot(self, other)
ValueError: shapes (1,70765) and (1,70765) not aligned: 70765 (dim 1) != 1 (dim 0)