Question

我正在尝试计算两个向量的余弦相似度。这两个向量（称为Ri和Rj）是用户＆＃39;关于项目i和j的评级，所以很自然地它们是稀疏的（因为通常只有少数用户会对特定项目进行评级）。这些向量有50000行，只有0.1％非零。

余弦相似性应涉及共同评级的用户评级。例如，如果Ri和Rj是两个scipy.sparse.csc矩阵，它们的值是

Ri = [1,2,0,0,3,4] Rj = [0,1,0,3,5,2]

然后是评级等级

日＆＃39; = [0,2,0,0,3,4] RJ＆＃39; = [0,1,0,0,5,2]

所以余弦相似度应为

内心（Ri＆＃39;，Rj＆＃39;）/（| Ri＆＃39; | * | Rj＆＃39; |）

我的问题是，是否有一种有效的（最好是非循环）方法来计算哪个条目的矩阵都具有非零值？谢谢！

Answer 1

不确定你在这里询问哪个矩阵，但假设你在变量中有两个原始数组，

Ri = [ 1, 2, 0, 0, 3, 4]; Rj = [ 0, 1, 0, 3, 5, 2]

以下是如何构建共同评级并计算余弦相似度的方法，

import numpy as np
Rip = np.array( [ i if j != 0 else 0 for i,j in zip(Ri,Rj) ] )
Rjp = np.array( [ j if i != 0 else 0 for i,j in zip(Ri,Rj) ] )

如果您不想明确使用for语句，可以使用地图

Rip = map( lambda x,y: 0 if y == 0 else x, Ri, Rj )
Rjp = map( lambda x,y: 0 if x == 0 else y, Ri, Rj )

然后可以使用Rip和Rjp

的这些显式（或密集）表示来计算余弦相似度

cos_sim = float( np.dot( Rip, Rjp ) ) / np.sqrt( np.dot( Rip,Rip ) * np.dot( Rjp,Rjp ) )

如果您不想显式存储完整数组，可以使用scipy.sparse将向量存储为稀疏单行（列）矩阵。请注意，如果您这样做，np.dot将不再有效，您应该使用sparse matrices的dot方法。

from scipy.sparse import csr_matrix

# make single column/row sparse matrix reps of Rip
row = np.array( [ i for (i,x) in enumerate(Rip) if x != 0 ] )
col = np.zeros( row.size, dtype=np.int32 )
dat = np.array( [ x for (i,x) in enumerate(Rip) if x != 0 ] )
Rip_col_mat = csr_matrix( (dat,(row,col) ) )
Rip_row_mat = csr_matrix( (dat,(col,row) ) )

# make single column/row sparse matrix reps of Rjp
row = np.array( [ i for (i,x) in enumerate(Rjp) if x != 0 ] )
col = np.zeros( row.size, dtype=np.int32 )
dat = np.array( [ x for (i,x) in enumerate(Rjp) if x != 0 ] )
Rjp_col_mat = csr_matrix( (dat,(row,col) ) )
Rjp_row_mat = csr_matrix( (dat,(col,row) ) )

现在计算我们可以做的余弦相似度，

inner = Rip_row_mat.dot( Rjp_col_mat ).data
Rip_m = np.sqrt( Rip_row_mat.dot( Rip_col_mat ).data )
Rjp_m = np.sqrt( Rjp_row_mat.dot( Rjp_col_mat ).data )

cos_sim = inner / ( Rip_m * Rjp_m )

scipy稀疏矩阵之间的特殊余弦

1 个答案: