查找最近的匹配行

时间:2018-09-19 12:05:21

标签: python comparison distance

我正在创建距离矩阵,以将第一行数据帧与其他数据帧进行比较,以找到最接近的匹配项。现在,我的代码可以正常工作,但是当有大量数据时,我想将相似的行索引与相似的列名进行匹配,并获得带有列名的最小值。

示例: 我想将a与b系列与b系列等匹配

+----+----+----+----+----+----+----+----+
| id | a1 | a2 | b3 | b4 | b5 | c6 | c7 |
+----+----+----+----+----+----+----+----+
| a1 |  0 |  8 |  3 | 10 |  2 |  6 |  3 |
| a2 |  0 |  8 |  9 |  1 |  6 |  4 |  2 |
| a3 | 10 |  1 |  1 |  2 |  0 |  7 |  6 |
| b4 |  4 |  6 |  7 |  7 |  9 |  1 | 10 |
| b5 | 10 |  1 | 10 |  0 |  2 |  5 |  4 |
| c6 |  9 |  2 |  0 |  8 |  5 |  4 |  3 |
| c7 |  1 |  9 |  5 | 10 |  0 |  8 |  9 |
| c8 |  7 |  2 |  8 |  3 |  5 |  3 |  6 |
+----+----+----+----+----+----+----+----+

以便它可以像

进行比较
+----+----+----+
|    | a1 | a2 |
+----+----+----+
| a1 |  0 |  8 |
| a2 |  0 |  8 |
| a3 | 10 |  1 |
+----+----+----+

&输出将为

+----+-----------+----------+
| id | min_score | col_name |
+----+-----------+----------+
| a1 |         0 | a1       |
| a2 |         0 | a1       |
| a3 |         1 | a2       |
| b4 |         7 | b3,b4    |
| b5 |         0 | b4       |
| c6 |         3 | c7       |
| c7 |         8 | c6       |
| c8 |         3 | c6       |
+----+-----------+----------+

这是我的代码,但没有提供我所需的输出

mat = scipy.spatial.distance.cdist(df[['team1','team2','team3']], 
                               df1[['team1','team2','team3']],
                               metric='jaccard')
new_df = pd.DataFrame(mat, index=df['id'], columns=df1['id'])
closest = np.where(a.eq(a[a != 0].min(),0),df.columns,False)
# Store the array values in a variable
arr = new_df.values
arr[np.diag_indices_from(new_df)] = np.nan
#Replace the non nan min with column name and otherwise with false
new_close = np.where(arr == np.nanmin(arr, axis=1)[:,None],new_df.columns,False) 
df['close'] = [i[i.astype(bool)].tolist() for i in new_close]

有帮助吗?预先谢谢你。

1 个答案:

答案 0 :(得分:0)

我用以下代码重新创建了数据集(总是在问题中发布一种创建数据框的方法,回答起来会更快):

import pandas as pd
import numpy as np

columns = ['a1','a2', 'b3', 'b4','b5', 'c6', 'c7']
index   = ['a1', 'a2', 'a3', 'b4', 'b5', 'c6', 'c7', 'c8']
data    = np.random.randint(0, 11, size(len(index),len(columns)))

df = pd.DataFrame(index=index, columns=columns, data=data)
print(df)

    a1  a2  b3  b4  b5  c6  c7
a1  8   7   0   9   7   6   8
a2  3   5   4   7   3   6   9
a3  3   3   10  10  7   6   7
b4  2   7   4   5   6   7   2
b5  5   8   8   5   1   2   10
c6  6   10  9   1   0   9   5
c7  1   10  6   4   9   1   2
c8  9   7   4   8   4   3   10

并使用此代码获得带有'min_score'和'col_name'列的数据框'df_new'。

new_df = pd.DataFrame(index=df.index, columns=['min_score', 'col_name'])
for character in df.columns.str[:1].unique():
    # create new dataframe with a subset per character
    columns_char = df.columns[df.columns.str.startswith(character)]
    index_char   = df.index[df.index.str.startswith(character)]
    df_char = df[columns_char].loc[index_char]

    # find min score and col name
    df_char_min = df_char.min(axis=1)
    new_df['min_score'].loc[df_char.index] = df_char_min
    new_df['col_name'].loc[df_char.index] = df_char.isin(df_char_min).apply(lambda x: ','.join(x.index[x == 1]), axis=1)

print(new_df)

    min_score   col_name
a1  7.0         a2
a2  3.0         a1
a3  3.0         a1,a2
b4  4.0         b3
b5  1.0         b5
c6  5.0         c7
c7  1.0         c6
c8  3.0         c6