如何计算两个大型稀疏矩阵之间的余弦相似度?

时间:2019-08-16 17:12:08

标签: python numpy scipy sparse-matrix cosine-similarity

我想计算两个巨型稀疏矩阵的每一对行之间的余弦相似度。传统函数为所有成对的行计算,就我而言,甚至更“容易”,我没有找到实现这一点的实现。

例如:结果将是cos(A[row i], B[row i]) for all i in N的数组。

我已经尝试过对每一行使用map,但是速度真的很慢。现在,我一直在尝试使用矩阵运算来计算所有行。

import numpy as np
import scipy.sparse as sp

A = sp.csr_matrix((N,N))
B = sp.csr_matrix((N,N))

A_norm = sp.dok_matrix(A.shape)
A_norm[A.nonzero()] = A[A.nonzero()] / A[A.nonzero()].sum(axis=0)

B_norm = sp.dok_matrix(B.shape)
B_norm[B.nonzero()] = B[B.nonzero()] / B[B.nonzero()].sum(axis=0)

AB = (A_norm*B_norm).sum(axis=0)

AA = np.sum(np.sqrt( A_norm*A_norm ), axis=0)
BB = np.sum(np.sqrt( B_norm*B_norm ), axis=0)
AA_BB = (AA*BB)

cos = sp.dok_matrix(AB.shape)
cos[AA_BB.nonzero()] = AB[AA_BB.nonzero()] / AA_BB[AA_BB.nonzero()]

当我在上面的代码中计算(AA * BB)时发生错误。堆栈是:

--------------------------------------------------------------------------
ValueError                               Traceback (most recent call last)
<ipython-input-244-00be56e17413> in <module>()
      9 AA = np.sum(np.sqrt( A_norm*A_norm ), axis=0)
     10 BB = np.sum(np.sqrt( B_norm*B_norm ), axis=0)
---> 11 AA_BB = (AA*BB)
     12 
     13 cos = sp.dok_matrix(AB.shape)

/usr/local/lib/python3.6/dist-packages/numpy/matrixlib/defmatrix.py in __mul__(self, other)
    218         if isinstance(other, (N.ndarray, list, tuple)) :
    219             # This promotes 1-D vectors to row vectors
--> 220             return N.dot(self, asmatrix(other))
    221         if isscalar(other) or not hasattr(other, '__rmul__') :
    222             return N.dot(self, other)

ValueError: shapes (1,70765) and (1,70765) not aligned: 70765 (dim 1) != 1 (dim 0)

0 个答案:

没有答案