计算pandas中列的余弦相似度

时间:2016-06-20 11:21:12

标签: python numpy pandas

我是熊猫的新手,我有这两个系列。

train['description_1']train['description_2']是系列。它们每个都包含每行的向量。

from scipy.spatial.distance import cosine
item3 = pd.concat([train['description_1'], train['description_2']], axis = 1)
cos_vec = item3.apply(cosine)

例外是TypeError: ('cosine() takes exactly 2 arguments (1 given)', u'occurred at index description_1')

火车['描述']的每个元素都包含一个向量。

我期待这样的事情

train_1       train_2
[1.0,2.0]     [2.0,3.0] 
[2.0,2.0]     [3.0,2.0] 


Output:

cos_sim 
x
y

1 个答案:

答案 0 :(得分:3)

你需要:

import pandas as pd
from scipy.spatial.distance import cosine

df = pd.DataFrame({'description_1':[0.1,0.32,0.3],
                   'description_2':[0.4,0.5,0.6]})


print (df)
   description_1  description_2
0           0.10            0.4
1           0.32            0.5
2           0.30            0.6

cos_vec = (1 - cosine(df["description_1"], df["description_2"]))
print (cos_vec)
0.962571458085

编辑:

import pandas as pd
from scipy.spatial.distance import cosine

df = pd.DataFrame({'description_1':[[1.0,2.0],[2.0,2.0]],
                   'description_2':[[2.0,3.0],[3.0,2.0]]})


print (df)
  description_1 description_2
0    [1.0, 2.0]    [2.0, 3.0]
1    [2.0, 2.0]    [3.0, 2.0]

cos_vec = df.apply(lambda x: (1 - cosine(x["description_1"], x["description_2"])), axis=1)
print (cos_vec)
0    0.992278
1    0.980581
dtype: float64