Numpy Cosine与大型系列的相似之处

时间:2016-09-07 09:27:39

标签: python numpy scikit-learn cosine-similarity

我需要在大矩阵上使用Scikit-learn sklearn.metric.pairwise.cosine_similarity 。 对于某些优化,我只需要计算矩阵的某些行,所以我尝试了不同的方法。

我发现在某些情况下结果根据向量的大小而不同,我在这个测试用例中看到了这种奇怪的行为(大向量,转置和估计余弦):

from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial
import numpy as np
from scipy.sparse import csc_matrix

size=200
a=np.array([[1,0,1,0]]*size)
sparse_a=csc_matrix(a.T)
#standard cosine similarity between the whole transposed matrix, take only the first row
res1=cosine_similarity(a.T,a.T)[0]
#take the row obtained by the multiplication of the first row of the transposed matrix with transposed matrix itself (optimized for the first row calculus only)
res2=cosine_similarity([a.T[0]],a.T)[0]
#sparse matrix implementation with the transposed, which should be faster
res3=cosine_similarity(sparse_a,sparse_a)[0]
print("res1: ",res1)
print("res2: ",res2)
print("res3: ",res3)
print("res1 vs res2: ",res1==res2)
print("res1 vs res3: ",res1==res3)
print("res2 vs res3: ", res2==res3)

如果" 尺寸"设置为 200 我得到了这个结果,没关系:

res1:  [ 1.  0.  1.  0.]
res2:  [ 1.  0.  1.  0.]
res3:  [ 1.  0.  1.  0.]
res1 vs res2:  [ True  True  True  True]
res1 vs res3:  [ True  True  True  True]
res2 vs res3:  [ True  True  True  True]

但是如果" 尺寸"设置为 2000 或更多,会发生一些奇怪的事情:

res1:  [ 1.  0.  1.  0.]
res2:  [ 1.  0.  1.  0.]
res3:  [ 1.  0.  1.  0.]
res1 vs res2:  [False  True False  True]
res1 vs res3:  [False  True False  True]
res2 vs res3:  [ True  True  True  True]

有人知道我错过了什么吗?

提前致谢

1 个答案:

答案 0 :(得分:0)

为了比较numpy.array,您必须使用np.isclose而不是等于运算符。尝试:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial
import numpy as np
from scipy.sparse import csc_matrix

size=2000
a=np.array([[1,0,1,0]]*size)
sparse_a=csc_matrix(a.T)
#standard cosine similarity between the whole transposed matrix, take only the first row
res1=cosine_similarity(a.T,a.T)[0]
#take the row obtained by the multiplication of the first row of the transposed matrix with transposed matrix itself (optimized for the first     row calculus only)
res2=cosine_similarity([a.T[0]],a.T)[0]
#sparse matrix implementation with the transposed, which should befaster
res3=cosine_similarity(sparse_a,sparse_a)[0]
print("res1: ",res1)
print("res2: ",res2)
print("res3: ",res3)
print("res1 vs res2: ", np.isclose(res1, res2))
print("res1 vs res3: ", np.isclose(res1, res3))
print("res2 vs res3: ", np.isclose(res2, res2))

结果是:

res1:  [ 1.  0.  1.  0.]
res2:  [ 1.  0.  1.  0.]
res3:  [ 1.  0.  1.  0.]
res1 vs res2:  [ True  True  True  True]
res1 vs res3:  [ True  True  True  True]
res2 vs res3:  [ True  True  True  True]

正如所料。