假设我有两个DataFrame
:XA
和XB
,例如每个都有3行和2列:
import pandas as pd
XA = pd.DataFrame({
'x1': [1, 2, 3],
'x2': [4, 5, 6]
})
XB = pd.DataFrame({
'x1': [8, 7, 6],
'x2': [5, 4, 3]
})
对于XA
中的每条记录,我想找到XB
中最近的记录(例如,基于欧几里得距离)以及相应的距离。例如,这可能返回在DataFrame
上建立索引的id_A
,并带有id_B
和distance
的列。
如何最有效地做到这一点?
答案 0 :(得分:1)
一种方法是计算完整距离矩阵,然后melt
并使用nsmallest
进行汇总,这将返回索引以及值:
from scipy.spatial.distance import cdist
def nearest_record(XA, XB):
"""Get the nearest record in XA for each record in XB.
Args:
XA: DataFrame. Each record is matched against the nearest in XB.
XB: DataFrame.
Returns:
DataFrame with columns for id_A (from XA), id_B (from XB), and dist.
Each id_A maps to a single id_B, which is the nearest record from XB.
"""
dist = pd.DataFrame(cdist(XA, XB)).reset_index().melt('index')
dist.columns = ['id_A', 'id_B', 'dist']
# id_B is sometimes returned as an object.
dist['id_B'] = dist.id_B.astype(int)
dist.reset_index(drop=True, inplace=True)
nearest = dist.groupby('id_A').dist.nsmallest(1).reset_index()
return nearest.set_index('level_1').join(dist.id_B).reset_index(drop=True)
这表明id_B
2是最接近XA
中三个记录的每个记录:
nearest_record(XA, XB)
id_A dist id_B
0 0 5.099020 2
1 1 4.472136 2
2 2 4.242641 2
但是,由于这涉及到计算完整距离矩阵,因此当XA
和XB
较大时,它将变慢或失败。一种计算每一行最近的替代方法可能会更快。
答案 1 :(得分:0)
修改this answer以避开完整距离矩阵,您可以在XA
(nearest_record1()
)中找到每一行的最近记录和距离,然后调用apply
来运行在每行(nearest_record()
上浏览。这样test中的运行时间减少了约85%。
from scipy.spatial.distance import cdist
def nearest_record1(XA1, XB):
"""Get the nearest record between XA1 and XB.
Args:
XA: Series.
XB: DataFrame.
Returns:
DataFrame with columns for id_B (from XB) and dist.
"""
dist = cdist(XA1.values.reshape(1, -1), XB)[0]
return pd.Series({'dist': np.amin(dist), 'id_B': np.argmin(dist)})
def nearest_record(XA, XB):
"""Get the nearest record in XA for each record in XB.
Args:
XA: DataFrame. Each record is matched against the nearest in XB.
XB: DataFrame.
Returns:
DataFrame with columns for id_A (from XA), id_B (from XB), and dist.
Each id_A maps to a single id_B, which is the nearest record from XB.
"""
res = XA.apply(lambda x: nearest_record1(x, XB), axis=1)
res['id_A'] = XA.index
# id_B is sometimes returned as an object.
res['id_B'] = res.id_B.astype(int)
# Reorder columns.
return res[['id_A', 'id_B', 'dist']]
这还会返回正确的结果:
nearest_record(XA, XB)
id_A id_B dist
0 0 2 5.099020
1 1 2 4.472136
2 2 2 4.242641