Question

我有一个商店的数据集，该商店的每日时间戳为2D位置。我正在尝试将每一行与在其他一些位置的站点进行的气象测量值进行匹配，并与每日时间戳进行匹配，以使每个商店与匹配的站点之间的笛卡尔距离最小化。并非每天都进行天气测量，车站的位置可能会有所不同，因此这是在每个特定日期为每个特定商店找到最近的车站的问题。

我意识到我可以构造嵌套循环来执行匹配，但是我想知道这里是否有人可以想到一些使用熊猫数据框操作来完成此操作的巧妙方法。玩具示例数据集如下所示。为简单起见，它具有静态气象站位置。

store_df = pd.DataFrame({
    'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],
    'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],
    'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})

weather_station_df = pd.DataFrame({
    'station_id': [1, 1, 1, 2, 2, 3, 3, 3],
    'weather': [20, 21, 19, 17, 16, 18, 19, 17],
    'x': [0, 0, 0, 5, 5, 3, 3, 3],
    'y': [2, 2, 2, 1, 1, 3, 3, 3],
    'date': [1, 2, 3, 1, 3, 1, 2, 3]})

以下数据是所需的结果。我包括station_id只是为了澄清。

   store_id  date  station_id  weather
0         1     1           1       20
1         1     2           1       21
2         1     3           1       19
3         2     1           2       17
4         2     2           3       19
5         2     3           2       16
6         3     1           3       18
7         3     2           3       19
8         3     3           3       17

Answer 1

import math
import numpy as np

def distance(x1, x2, y1, y2):
    return np.sqrt((x2-x1)**2 + (y2-y1)**2)

#Join On Date to get all combinations of store and stations per day
df_all = store_df.merge(weather_station_df, on=['date'])

#Apply distance formula to each combination
df_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])

#Get Minimum distance for each day Per store_id
df_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()

#Use resulting minimums to get the station_id matching the min distances
closest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')

#filter out the unnecessary columns
result_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])

编辑：使用矢量化距离公式

Answer 2

该解决方案的想法是构建所有组合的表，

df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))

计算距离

df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2

并选择每组的最小值：

df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()

如果您有很多约会，可以按组参加。

基于距离最小化加入熊猫数据帧

2 个答案: