我有长度为22的数字数据集,其中每个数字可以介于0到1之间,表示该属性的百分比。
[0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]
[0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]
[0.01, 0.07, 0.59, 0.2, 0, 0, 0, 0, 0, 0.05, 0, 0, 0, 0, 0, 0, 0.07, 0, 0, 0, 0, 0]
[0.55, 0.12, 0.26, 0.01, 0, 0, 0, 0.01, 0.02, 0, 0, 0.01, 0, 0, 0.01, 0, 0.01, 0, 0, 0, 0, 0]
[0, 0.46, 0.43, 0.05, 0, 0, 0, 0, 0, 0, 0, 0.02, 0, 0, 0, 0, 0.02, 0.02, 0, 0, 0, 0]
如何使用Python计算这两个数据集之间的余弦相似度?
答案 0 :(得分:4)
根据Cosine similarity的定义,您只需计算两个向量a
和b
的标准化点积:
import numpy as np
a = [0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]
b = [0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]
print np.dot(a, b) / np.linalg.norm(a) / np.linalg.norm(b)
输出:
0.115081383219
答案 1 :(得分:1)
不依赖于numpy,你可以选择
result = (sum(ax*bx for ax, bx in a, b) /
(sum(ax**2 for ax in a) +
sum(bx**2 for bx in b))**0.5)
答案 2 :(得分:0)
您可以直接从sklearn
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.asmatrix([1,2,3]), np.asmatrix([4,5,6]))[0][0]
输出
0.97463184619707621
注意(因为numpy
方法通常在矩阵上运行)
如果您不使用np.asmatrix(),您将收到以下警告
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample
要将最终值作为标量,您需要在输出中使用[0][0]
,