我正在创建距离矩阵,以将第一行数据帧与其他数据帧进行比较,以找到最接近的匹配项。现在,我的代码可以正常工作,但是当有大量数据时,我想将相似的行索引与相似的列名进行匹配,并获得带有列名的最小值。
示例: 我想将a与b系列与b系列等匹配
+----+----+----+----+----+----+----+----+
| id | a1 | a2 | b3 | b4 | b5 | c6 | c7 |
+----+----+----+----+----+----+----+----+
| a1 | 0 | 8 | 3 | 10 | 2 | 6 | 3 |
| a2 | 0 | 8 | 9 | 1 | 6 | 4 | 2 |
| a3 | 10 | 1 | 1 | 2 | 0 | 7 | 6 |
| b4 | 4 | 6 | 7 | 7 | 9 | 1 | 10 |
| b5 | 10 | 1 | 10 | 0 | 2 | 5 | 4 |
| c6 | 9 | 2 | 0 | 8 | 5 | 4 | 3 |
| c7 | 1 | 9 | 5 | 10 | 0 | 8 | 9 |
| c8 | 7 | 2 | 8 | 3 | 5 | 3 | 6 |
+----+----+----+----+----+----+----+----+
以便它可以像
进行比较+----+----+----+
| | a1 | a2 |
+----+----+----+
| a1 | 0 | 8 |
| a2 | 0 | 8 |
| a3 | 10 | 1 |
+----+----+----+
&输出将为
+----+-----------+----------+
| id | min_score | col_name |
+----+-----------+----------+
| a1 | 0 | a1 |
| a2 | 0 | a1 |
| a3 | 1 | a2 |
| b4 | 7 | b3,b4 |
| b5 | 0 | b4 |
| c6 | 3 | c7 |
| c7 | 8 | c6 |
| c8 | 3 | c6 |
+----+-----------+----------+
这是我的代码,但没有提供我所需的输出
mat = scipy.spatial.distance.cdist(df[['team1','team2','team3']],
df1[['team1','team2','team3']],
metric='jaccard')
new_df = pd.DataFrame(mat, index=df['id'], columns=df1['id'])
closest = np.where(a.eq(a[a != 0].min(),0),df.columns,False)
# Store the array values in a variable
arr = new_df.values
arr[np.diag_indices_from(new_df)] = np.nan
#Replace the non nan min with column name and otherwise with false
new_close = np.where(arr == np.nanmin(arr, axis=1)[:,None],new_df.columns,False)
df['close'] = [i[i.astype(bool)].tolist() for i in new_close]
有帮助吗?预先谢谢你。
答案 0 :(得分:0)
我用以下代码重新创建了数据集(总是在问题中发布一种创建数据框的方法,回答起来会更快):
import pandas as pd
import numpy as np
columns = ['a1','a2', 'b3', 'b4','b5', 'c6', 'c7']
index = ['a1', 'a2', 'a3', 'b4', 'b5', 'c6', 'c7', 'c8']
data = np.random.randint(0, 11, size(len(index),len(columns)))
df = pd.DataFrame(index=index, columns=columns, data=data)
print(df)
a1 a2 b3 b4 b5 c6 c7
a1 8 7 0 9 7 6 8
a2 3 5 4 7 3 6 9
a3 3 3 10 10 7 6 7
b4 2 7 4 5 6 7 2
b5 5 8 8 5 1 2 10
c6 6 10 9 1 0 9 5
c7 1 10 6 4 9 1 2
c8 9 7 4 8 4 3 10
并使用此代码获得带有'min_score'和'col_name'列的数据框'df_new'。
new_df = pd.DataFrame(index=df.index, columns=['min_score', 'col_name'])
for character in df.columns.str[:1].unique():
# create new dataframe with a subset per character
columns_char = df.columns[df.columns.str.startswith(character)]
index_char = df.index[df.index.str.startswith(character)]
df_char = df[columns_char].loc[index_char]
# find min score and col name
df_char_min = df_char.min(axis=1)
new_df['min_score'].loc[df_char.index] = df_char_min
new_df['col_name'].loc[df_char.index] = df_char.isin(df_char_min).apply(lambda x: ','.join(x.index[x == 1]), axis=1)
print(new_df)
min_score col_name
a1 7.0 a2
a2 3.0 a1
a3 3.0 a1,a2
b4 4.0 b3
b5 1.0 b5
c6 5.0 c7
c7 1.0 c6
c8 3.0 c6