Question

我有两个数据帧，两个数据帧都包含纬度和经度列。对于第一个数据帧中的每个lat / lon条目，我想评估第二个数据帧中的每个lat / lon对以确定距离。

例如：

df1:                     df2:

     lat     lon              lat     lon 
0   38.32  -100.50       0   37.65   -97.87
1   42.51   -97.39       1   33.31   -96.40
2   33.45  -103.21       2   36.22  -100.01

distance between 38.32,-100.50 and 37.65,-97.87
distance between 38.32,-100.50 and 33.31,-96.40
distance between 38.32,-100.50 and 36.22,-100.01
distance between 42.51,-97.39 and 37.65,-97.87
distance between 42.51,-97.39 and 33.31,-96.40
...and so on...

我不知道该怎么做。

感谢您的帮助！

Answer 1

Euclidean Distance计算为

$edpic$

您可以使用这样的两个数据框执行此操作

((df1 - df2) ** 2).sum(1) ** .5

0    2.714001
1    9.253113
2    4.232363
dtype: float64

Answer 2

更新：如@root所述，在这种情况下使用欧几里德指标并没有多大意义，所以让我们使用sklearn.neighbors.DistanceMetric

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

首先我们可以使用所有组合构建DF - (c) root：

x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
      .drop('k',1)

矢量化“半影”距离计算

x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)

结果：

In [86]: x
Out[86]:
    lat1    lon1   lat2    lon2         dist
0  38.32 -100.50  37.65  -97.87   242.073182
1  38.32 -100.50  33.31  -96.40   667.993048
2  38.32 -100.50  36.22 -100.01   237.350451
3  42.51  -97.39  37.65  -97.87   541.605087
4  42.51  -97.39  33.31  -96.40  1026.006744
5  42.51  -97.39  36.22 -100.01   734.219411
6  33.45 -103.21  37.65  -97.87   671.274044
7  33.45 -103.21  33.31  -96.40   632.004981
8  33.45 -103.21  36.22 -100.01   424.140594

OLD回答：

IIUC你可以使用成对距离scipy.spatial.distance.pdist：

In [32]: from scipy.spatial.distance import pdist

In [43]: from itertools import combinations

In [34]: X = pd.concat([df1, df2])

In [35]: X
Out[35]:
     lat     lon
0  38.32 -100.50
1  42.51  -97.39
2  33.45 -103.21
0  37.65  -97.87
1  33.31  -96.40
2  36.22 -100.01

作为Pandas.Series：

In [36]: s = pd.Series(pdist(X),
                       index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))

In [37]: s
Out[37]:
0  1     5.218065
   2     5.573240
   0     2.714001
   1     6.473801
   2     2.156409
1  2    10.768287
   0     4.883646
   1     9.253113
   2     6.813846
2  0     6.793791
   1     6.811439
   2     4.232363
0  1     4.582194
   2     2.573810
1  2     4.636831
dtype: float64

作为Pandas.DataFrame：

In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
    df1  df2       dist
0     0    1   5.218065
1     0    2   5.573240
2     0    0   2.714001
3     0    1   6.473801
4     0    2   2.156409
5     1    2  10.768287
6     1    0   4.883646
7     1    1   9.253113
8     1    2   6.813846
9     2    0   6.793791
10    2    1   6.811439
11    2    2   4.232363
12    0    1   4.582194
13    0    2   2.573810
14    1    2   4.636831

Answer 3

您可以执行交叉连接以获取lat / lon的所有组合，然后使用适当的度量计算距离。为此，您可以使用提供geopy和geopy.distance.vincenty的geopy.distance.great_circle包。两者都应该给出有效距离，vincenty给出更准确的结果，但计算速度慢。

from geopy.distance import vincenty

# Function to compute distances.
def get_lat_lon_dist(row):
    # Store lat/long as tuples for input into distance functions.
    latlon1 = tuple(row[['lat1', 'lon1']])
    latlon2 = tuple(row[['lat2', 'lon2']])

    # Compute the distance.
    return vincenty(latlon1, latlon2).km

# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
         .drop('k', axis=1)

# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)

我在示例中使用千米作为我的单位，但可以指定其他单位，例如：

vincenty(latlon1, latlon2).miles

结果输出：

    lat1    lon1   lat2    lon2     distance
0  38.32 -100.50  37.65  -97.87   242.709065
1  38.32 -100.50  33.31  -96.40   667.878723
2  38.32 -100.50  36.22 -100.01   237.080141
3  42.51  -97.39  37.65  -97.87   541.184297
4  42.51  -97.39  33.31  -96.40  1024.839512
5  42.51  -97.39  36.22 -100.01   733.819732
6  33.45 -103.21  37.65  -97.87   671.766908
7  33.45 -103.21  33.31  -96.40   633.751134
8  33.45 -103.21  36.22 -100.01   424.335874

修改

正如@MaxU在评论中所指出的那样，您可以以类似的方式使用numpy implementation of the Haversine formula以获得额外的性能。这应该等同于great_circle中的geopy函数。

比较两个独立的pandas数据帧中的列

3 个答案: