在pandas数据帧中获取彼此的最近点

时间:2017-11-28 15:05:02

标签: python pandas numpy dataframe

我有一个数据框:

  routeId  latitude_value  longitude_value
  r1       28.210216        22.813209
  r2       28.216103        22.496735
  r3       28.161786        22.842318
  r4       28.093110        22.807081
  r5       28.220370        22.503500
  r6       28.220370        22.503500
  r7       28.220370        22.503500

从此我想生成一个像这样的数据框 df2

routeId    nearest
  r1         r3         (for example)
  r2       ...    similarly for all the routes.

我想要实现的逻辑是

对于每条路线,我应该找到所有其他路线的欧氏距离。 并在routeId上迭代它。

有一个计算欧氏距离的功能。

dist = math.hypot(x2 - x1, y2 - y1)

但我很困惑如何构建一个我将传递数据帧的函数,或者使用.apply()

def  get_nearest_route():
    .....
    return df2

4 个答案:

答案 0 :(得分:6)

我建议使用pdist function from scipy.spatial.distance

matrix = scipy.spatial.distance.pdist(df[['latitude_value', 'longitude_value']], metric='euclidean')

将返回形状(n,)的压缩距离矩阵,并计算所有成对距离。

然后你可以使用squareform来获得方形的成对距离矩阵:

matrix = scipy.spatial.distance.squareform(matrix)

然后,对于每一行matrix[i],您可以找到索引处的最大值,例如matrix[i][j]你知道,对于第i点,它的最近点是第j点。

答案 1 :(得分:5)

我们可以使用scipy.spatial.distance.cdist或多个for循环然后用路径替换min并找到最接近的,即

mat = scipy.spatial.distance.cdist(df[['latitude_value','longitude_value']], 
                              df[['latitude_value','longitude_value']], metric='euclidean')

# If you dont want scipy, you can use plain python like 
# import math
# mat = []
# for i,j in zip(df['latitude_value'],df['longitude_value']):
#     k = []
#     for l,m in zip(df['latitude_value'],df['longitude_value']):
#         k.append(math.hypot(i - l, j - m))
#     mat.append(k)
# mat = np.array(mat)

new_df = pd.DataFrame(mat, index=df['routeId'], columns=df['routeId']) 

new_df

的输出
routeId        r1        r2        r3        r4        r5        r6        r7
routeId                                                                      
r1       0.000000  0.316529  0.056505  0.117266  0.309875  0.309875  0.309875
r2       0.316529  0.000000  0.349826  0.333829  0.007998  0.007998  0.007998
r3       0.056505  0.349826  0.000000  0.077188  0.343845  0.343845  0.343845
r4       0.117266  0.333829  0.077188  0.000000  0.329176  0.329176  0.329176
r5       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r6       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r7       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000    

#Replace minimum distance with column name and not the minimum with `False`.
# new_df[new_df != 0].min(),0). This gives a mask matching minimum other than zero.  
closest = np.where(new_df.eq(new_df[new_df != 0].min(),0),new_df.columns,False)

# Remove false from the array and get the column names as list . 
df['close'] = [i[i.astype(bool)].tolist() for i in closest]


 routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500          [r2]
5      r6       28.220370        22.503500          [r2]
6      r7       28.220370        22.503500          [r2] 

如果您不想忽略零,那么

# Store the array values in a variable
arr = new_df.values
# We dont want to find mimimum to be same point, so replace diagonal by nan
arr[np.diag_indices_from(new_df)] = np.nan

# Replace the non nan min with column name and otherwise with false
new_close = np.where(arr == np.nanmin(arr, axis=1)[:,None],new_df.columns,False)

# Get column names ignoring false. 
df['close'] = [i[i.astype(bool)].tolist() for i in new_close]

   routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500      [r6, r7]
5      r6       28.220370        22.503500      [r5, r7]
6      r7       28.220370        22.503500      [r5, r6]

答案 2 :(得分:2)

无法与scipy.spatial.distance.pdist进行比较,但仍可正常使用

from itertools import product
import pandas as pd
import math
df['New']=list(zip(df['latitude_value'],df['longitude_value']))
DF=pd.DataFrame(list(product(df.routeId, df.routeId)), columns=['l1', 'l2'])
New=df[['routeId','New']].merge(DF,left_on='routeId',right_on='l1',how='left').merge(df[['routeId','New']],left_on='l2',right_on='routeId')
New['Cal']=New.apply(lambda x : math.hypot(x.New_x[0] - x.New_y[0] , x.New_x[1]  - x.New_y[1] ),axis=1)
New=New.loc[New.l1!=New.l2,:]
New.sort_values('Cal').drop_duplicates(['l1'],keep='first')


Out[386]: 
   routeId_x                   New_x  l1  l2 routeId_y                   New_y       Cal
47        r6     (28.22037, 22.5035)  r6  r7        r7     (28.22037, 22.5035)  0.000000
41        r7     (28.22037, 22.5035)  r7  r6        r6     (28.22037, 22.5035)  0.000000
39        r5     (28.22037, 22.5035)  r5  r6        r6     (28.22037, 22.5035)  0.000000
43        r2  (28.216103, 22.496735)  r2  r7        r7     (28.22037, 22.5035)  0.007998
2         r3  (28.161786, 22.842318)  r3  r1        r1  (28.210216, 22.813209)  0.056505
14        r1  (28.210216, 22.813209)  r1  r3        r3  (28.161786, 22.842318)  0.056505
17        r4   (28.09311, 22.807081)  r4  r3        r3  (28.161786, 22.842318)  0.077188

答案 3 :(得分:0)

您可以使用itertuples来迭代DataFrame

data = {index: ((df['latitude_value'] - lat)**2 + (df['longitude_value'] - long)**2).drop(index).argmin() for index, lat, long in df.itertuples()}
pd.Series(data)

大型数据集可能需要很长时间