熊猫数据框中的余弦相似度行

时间:2020-12-22 01:15:38

标签: python-3.x pandas dataframe cosine-similarity

我有一个 CSV 文件,其内容如下,我想根据 CSV 文件中剩余的 ID 计算余弦相似度。

我已将其加载到熊猫数据框中,如下所示:

    old_df['Vector']=old_df.apply(lambda row: 
    np.array(np.matrix(row.Vector)).ravel(), axis = 1) 
    l=[]
    for a in old_df['Vector']:
        l.append(a)
    A=np.array(l)
    similarities = cosine_similarity(A)

输出看起来不错。但是,我不知道如何找到与其他GUID(或ID)相似的GUID(或ID),我只想获得前k个具有最高相似分数。

你能帮我解决这个问题吗。

谢谢。

|Index  |  GUID | Vector                                |
|-------|-------|---------------------------------------|
|36099  | b770  |[-0.04870541 -0.02133574  0.03180726]  |
|36098  | 808f  |[  0.0732905  -0.05331331  0.06378368] |
|36097  | b111  |[ 0.01994788  0.00417582 -0.09615131]  |
|36096  | b6b5  |[0.025697   -0.08277534 -0.0124591]    |
|36083  | 9b07  |[ 0.025697   -0.08277534 -0.0124591]   |
|36082  | b9ed  |[-0.00952298  0.06188576 -0.02636449]  |
|36081  | a5b6  |[0.00432161  0.02264584 -0.0341924]    |
|36080  | 9891  |[ 0.08732156  0.00649456 -0.02014138]  |
|36079  | ba40  |[0.05407356 -0.09085554 -0.07671648]   |
|36078  | 9dff  |[-0.09859556  0.04498474 -0.01839088]  |
|36077  | a423  |[-0.06124249  0.06774347 -0.05234318]  |
|36076  | 81c4  |[0.07278682 -0.10460124 -0.06572364]   |
|36075  | 9f88  |[0.09830415  0.05489364 -0.03916228]   |
|36074  | adb8  |[0.03149953 -0.00486591  0.01380711]   |
|36073  | 9765  |[0.00673934  0.0513557  -0.09584251]   |
|36072  | aff4  |[-0.00097896  0.0022945   0.01643319]  |

1 个答案:

答案 0 :(得分:1)

获取前 k 个余弦相似度及其对应的 GUID 和行 ID 的示例代码:

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

data = {"GUID": ["b770", "808f", "b111"], "Vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))

vectors = []
for v in df['Vector']:
    vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix
similarities = cosine_similarity(A)
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))

k = 2
if k > vectors_num:
    K = vectors_num
# Get top k similarities and pair GUID in ascending order
top_k_indexes = np.unravel_index(np.argsort(similarities.ravel())[-k:], similarities.shape)
top_k_similarities = similarities[top_k_indexes]
top_k_pair_GUID = []
for indexes in top_k_indexes:
    pair_GUID = (df.iloc[indexes[0]]["GUID"], df.iloc[indexes[1]]["GUID"])
    top_k_pair_GUID.append(pair_GUID)

print("top_k_indexes: \n{}\ntop_k_pair_GUID: \n{}\ntop_k_similarities: \n{}".format(top_k_indexes, top_k_pair_GUID, top_k_similarities))

输出:

Data:
   GUID             Vector
0  b770  [-0.1, -0.2, 0.3]
1  808f  [0.1, -0.2, -0.3]
2  b111  [-0.1, 0.2, -0.3]

Similarities:
[[-2.         -0.42857143 -0.85714286] 
 [-2.         -2.          0.28571429] 
 [-2.         -2.         -2.        ]]

top_k_indexes:
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
top_k_pair_GUID:
[('b770', '808f'), ('808f', 'b111')]
top_k_similarities:
[-0.42857143  0.28571429]