这样的数据:
u_df = pd.Series({'a':[0,0.11,0.22],'b':[0.92,0.11,0.65],'c':[0.2,0.5,0.23]}).reset_index()
u_df.columns = ['key','value']
v_df = pd.Series({'g':[0.5,0.21,0.5],'f':[0.12,0.191,0.68],'e':[0.2,0.1,0.23]}).reset_index()
v_df.columns = ['key','value']
key value
0 a [0, 0.11, 0.22]
1 b [0.92, 0.11, 0.65]
2 c [0.2, 0.5, 0.23]
key value
0 e [0.2, 0.1, 0.23]
1 f [0.12, 0.191, 0.68]
2 g [0.5, 0.21, 0.5]
我想计算这两个笛卡尔积的数据帧的余弦距离。我计算余弦距离的两个列表:
def dot(K, L):
if len(K) != len(L):
return 0
return sum(i[0] * i[1] for i in zip(K, L))
def similarity(item_1, item_2):
return dot(item_1, item_2) / np.sqrt(dot(item_1, item_1) * dot(item_2, item_2))
similarities = {item: similarity(target_features[item[0]], train_features[item[1]]) for item in itertools.product(target_features,train_features)}
但我希望直接从数据框计算它,我想要最后的结果如:
key1 key2 value
0 a e 0.780720058
1 a f 0.968164605
2 a g 0.733602842
3 b e 0.948870564
4 b f 0.707152537
……
答案 0 :(得分:1)
您可以先merge
使用cross join
,然后cosine distance
apply
获取from scipy.spatial.distance import cosine
u_df['tmp'] = 1
v_df['tmp'] = 1
df = pd.merge(u_df, v_df, on='tmp', how='outer')
df['value'] = df.apply(lambda x: (1 - cosine(x["value_x"], x["value_y"])), axis=1)
df = df[['key_x','key_y','value']]
print (df)
key_x key_y value
0 a e 0.780720
1 a f 0.968165
2 a g 0.733603
3 b e 0.948871
4 b f 0.707153
5 b g 0.967946
6 c e 0.760748
7 c f 0.657643
8 c g 0.740844
:
list |> Seq.windowed 2 |> Seq.map Array.average