计算两个数据帧的笛卡尔乘积的余弦距离

时间:2017-03-13 11:44:05

标签: python pandas dataframe

这样的数据:

u_df = pd.Series({'a':[0,0.11,0.22],'b':[0.92,0.11,0.65],'c':[0.2,0.5,0.23]}).reset_index()
u_df.columns = ['key','value']
v_df = pd.Series({'g':[0.5,0.21,0.5],'f':[0.12,0.191,0.68],'e':[0.2,0.1,0.23]}).reset_index()
v_df.columns = ['key','value']

    key        value
0   a     [0, 0.11, 0.22]
1   b  [0.92, 0.11, 0.65]
2   c    [0.2, 0.5, 0.23]

    key         value
0   e     [0.2, 0.1, 0.23]
1   f  [0.12, 0.191, 0.68]
2   g     [0.5, 0.21, 0.5]

我想计算这两个笛卡尔积的数据帧的余弦距离。我计算余弦距离的两个列表:

def dot(K, L):
        if len(K) != len(L):
                return 0
        return sum(i[0] * i[1] for i in zip(K, L))

def similarity(item_1, item_2):
        return dot(item_1, item_2) / np.sqrt(dot(item_1, item_1) * dot(item_2, item_2))

similarities = {item: similarity(target_features[item[0]], train_features[item[1]]) for item in itertools.product(target_features,train_features)}

但我希望直接从数据框计算它,我想要最后的结果如:

    key1   key2      value
0   a       e      0.780720058
1   a       f      0.968164605
2   a       g      0.733602842
3   b       e      0.948870564
4   b       f      0.707152537
……

1 个答案:

答案 0 :(得分:1)

您可以先merge使用cross join,然后cosine distance apply获取from scipy.spatial.distance import cosine u_df['tmp'] = 1 v_df['tmp'] = 1 df = pd.merge(u_df, v_df, on='tmp', how='outer') df['value'] = df.apply(lambda x: (1 - cosine(x["value_x"], x["value_y"])), axis=1) df = df[['key_x','key_y','value']] print (df) key_x key_y value 0 a e 0.780720 1 a f 0.968165 2 a g 0.733603 3 b e 0.948871 4 b f 0.707153 5 b g 0.967946 6 c e 0.760748 7 c f 0.657643 8 c g 0.740844

list |> Seq.windowed 2 |> Seq.map Array.average