我有两个数据帧,两个数据帧都包含纬度和经度列。对于第一个数据帧中的每个lat / lon条目,我想评估第二个数据帧中的每个lat / lon对以确定距离。
例如:
df1: df2: lat lon lat lon 0 38.32 -100.50 0 37.65 -97.87 1 42.51 -97.39 1 33.31 -96.40 2 33.45 -103.21 2 36.22 -100.01 distance between 38.32,-100.50 and 37.65,-97.87 distance between 38.32,-100.50 and 33.31,-96.40 distance between 38.32,-100.50 and 36.22,-100.01 distance between 42.51,-97.39 and 37.65,-97.87 distance between 42.51,-97.39 and 33.31,-96.40 ...and so on...
我不知道该怎么做。
感谢您的帮助!
答案 0 :(得分:3)
您可以使用这样的两个数据框执行此操作
((df1 - df2) ** 2).sum(1) ** .5
0 2.714001
1 9.253113
2 4.232363
dtype: float64
答案 1 :(得分:3)
更新:如@root所述,在这种情况下使用欧几里德指标并没有多大意义,所以让我们使用sklearn.neighbors.DistanceMetric
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
首先我们可以使用所有组合构建DF - (c) root:
x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k',1)
矢量化“半影”距离计算
x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)
结果:
In [86]: x
Out[86]:
lat1 lon1 lat2 lon2 dist
0 38.32 -100.50 37.65 -97.87 242.073182
1 38.32 -100.50 33.31 -96.40 667.993048
2 38.32 -100.50 36.22 -100.01 237.350451
3 42.51 -97.39 37.65 -97.87 541.605087
4 42.51 -97.39 33.31 -96.40 1026.006744
5 42.51 -97.39 36.22 -100.01 734.219411
6 33.45 -103.21 37.65 -97.87 671.274044
7 33.45 -103.21 33.31 -96.40 632.004981
8 33.45 -103.21 36.22 -100.01 424.140594
OLD回答:
IIUC你可以使用成对距离scipy.spatial.distance.pdist:
In [32]: from scipy.spatial.distance import pdist
In [43]: from itertools import combinations
In [34]: X = pd.concat([df1, df2])
In [35]: X
Out[35]:
lat lon
0 38.32 -100.50
1 42.51 -97.39
2 33.45 -103.21
0 37.65 -97.87
1 33.31 -96.40
2 36.22 -100.01
作为Pandas.Series:
In [36]: s = pd.Series(pdist(X),
index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))
In [37]: s
Out[37]:
0 1 5.218065
2 5.573240
0 2.714001
1 6.473801
2 2.156409
1 2 10.768287
0 4.883646
1 9.253113
2 6.813846
2 0 6.793791
1 6.811439
2 4.232363
0 1 4.582194
2 2.573810
1 2 4.636831
dtype: float64
作为Pandas.DataFrame:
In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
df1 df2 dist
0 0 1 5.218065
1 0 2 5.573240
2 0 0 2.714001
3 0 1 6.473801
4 0 2 2.156409
5 1 2 10.768287
6 1 0 4.883646
7 1 1 9.253113
8 1 2 6.813846
9 2 0 6.793791
10 2 1 6.811439
11 2 2 4.232363
12 0 1 4.582194
13 0 2 2.573810
14 1 2 4.636831
答案 2 :(得分:3)
您可以执行交叉连接以获取lat / lon的所有组合,然后使用适当的度量计算距离。为此,您可以使用提供geopy
和geopy.distance.vincenty
的geopy.distance.great_circle
包。两者都应该给出有效距离,vincenty
给出更准确的结果,但计算速度慢。
from geopy.distance import vincenty
# Function to compute distances.
def get_lat_lon_dist(row):
# Store lat/long as tuples for input into distance functions.
latlon1 = tuple(row[['lat1', 'lon1']])
latlon2 = tuple(row[['lat2', 'lon2']])
# Compute the distance.
return vincenty(latlon1, latlon2).km
# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k', axis=1)
# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)
我在示例中使用千米作为我的单位,但可以指定其他单位,例如:
vincenty(latlon1, latlon2).miles
结果输出:
lat1 lon1 lat2 lon2 distance
0 38.32 -100.50 37.65 -97.87 242.709065
1 38.32 -100.50 33.31 -96.40 667.878723
2 38.32 -100.50 36.22 -100.01 237.080141
3 42.51 -97.39 37.65 -97.87 541.184297
4 42.51 -97.39 33.31 -96.40 1024.839512
5 42.51 -97.39 36.22 -100.01 733.819732
6 33.45 -103.21 37.65 -97.87 671.766908
7 33.45 -103.21 33.31 -96.40 633.751134
8 33.45 -103.21 36.22 -100.01 424.335874
修改强>
正如@MaxU在评论中所指出的那样,您可以以类似的方式使用numpy implementation of the Haversine formula以获得额外的性能。这应该等同于great_circle
中的geopy
函数。