比较两个独立的pandas数据帧中的列

时间:2017-04-03 18:37:44

标签: python pandas

我有两个数据帧,两个数据帧都包含纬度和经度列。对于第一个数据帧中的每个lat / lon条目,我想评估第二个数据帧中的每个lat / lon对以确定距离。

例如:

df1:                     df2:

     lat     lon              lat     lon 
0   38.32  -100.50       0   37.65   -97.87
1   42.51   -97.39       1   33.31   -96.40
2   33.45  -103.21       2   36.22  -100.01

distance between 38.32,-100.50 and 37.65,-97.87
distance between 38.32,-100.50 and 33.31,-96.40
distance between 38.32,-100.50 and 36.22,-100.01
distance between 42.51,-97.39 and 37.65,-97.87
distance between 42.51,-97.39 and 33.31,-96.40
...and so on...

我不知道该怎么做。

感谢您的帮助!

3 个答案:

答案 0 :(得分:3)

Euclidean Distance计算为

edpic

您可以使用这样的两个数据框执行此操作

((df1 - df2) ** 2).sum(1) ** .5

0    2.714001
1    9.253113
2    4.232363
dtype: float64

答案 1 :(得分:3)

更新:如@root所述,在这种情况下使用欧几里德指标并没有多大意义,所以让我们使用sklearn.neighbors.DistanceMetric

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

首先我们可以使用所有组合构建DF - (c) root

x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
      .drop('k',1)

矢量化“半影”距离计算

x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)

结果:

In [86]: x
Out[86]:
    lat1    lon1   lat2    lon2         dist
0  38.32 -100.50  37.65  -97.87   242.073182
1  38.32 -100.50  33.31  -96.40   667.993048
2  38.32 -100.50  36.22 -100.01   237.350451
3  42.51  -97.39  37.65  -97.87   541.605087
4  42.51  -97.39  33.31  -96.40  1026.006744
5  42.51  -97.39  36.22 -100.01   734.219411
6  33.45 -103.21  37.65  -97.87   671.274044
7  33.45 -103.21  33.31  -96.40   632.004981
8  33.45 -103.21  36.22 -100.01   424.140594

OLD回答:

IIUC你可以使用成对距离scipy.spatial.distance.pdist

In [32]: from scipy.spatial.distance import pdist

In [43]: from itertools import combinations

In [34]: X = pd.concat([df1, df2])

In [35]: X
Out[35]:
     lat     lon
0  38.32 -100.50
1  42.51  -97.39
2  33.45 -103.21
0  37.65  -97.87
1  33.31  -96.40
2  36.22 -100.01

作为Pandas.Series:

In [36]: s = pd.Series(pdist(X),
                       index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))

In [37]: s
Out[37]:
0  1     5.218065
   2     5.573240
   0     2.714001
   1     6.473801
   2     2.156409
1  2    10.768287
   0     4.883646
   1     9.253113
   2     6.813846
2  0     6.793791
   1     6.811439
   2     4.232363
0  1     4.582194
   2     2.573810
1  2     4.636831
dtype: float64

作为Pandas.DataFrame:

In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
    df1  df2       dist
0     0    1   5.218065
1     0    2   5.573240
2     0    0   2.714001
3     0    1   6.473801
4     0    2   2.156409
5     1    2  10.768287
6     1    0   4.883646
7     1    1   9.253113
8     1    2   6.813846
9     2    0   6.793791
10    2    1   6.811439
11    2    2   4.232363
12    0    1   4.582194
13    0    2   2.573810
14    1    2   4.636831

答案 2 :(得分:3)

您可以执行交叉连接以获取lat / lon的所有组合,然后使用适当的度量计算距离。为此,您可以使用提供geopygeopy.distance.vincentygeopy.distance.great_circle包。两者都应该给出有效距离,vincenty给出更准确的结果,但计算速度慢。

from geopy.distance import vincenty

# Function to compute distances.
def get_lat_lon_dist(row):
    # Store lat/long as tuples for input into distance functions.
    latlon1 = tuple(row[['lat1', 'lon1']])
    latlon2 = tuple(row[['lat2', 'lon2']])

    # Compute the distance.
    return vincenty(latlon1, latlon2).km

# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
         .drop('k', axis=1)

# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)

我在示例中使用千米作为我的单位,但可以指定其他单位,例如:

vincenty(latlon1, latlon2).miles

结果输出:

    lat1    lon1   lat2    lon2     distance
0  38.32 -100.50  37.65  -97.87   242.709065
1  38.32 -100.50  33.31  -96.40   667.878723
2  38.32 -100.50  36.22 -100.01   237.080141
3  42.51  -97.39  37.65  -97.87   541.184297
4  42.51  -97.39  33.31  -96.40  1024.839512
5  42.51  -97.39  36.22 -100.01   733.819732
6  33.45 -103.21  37.65  -97.87   671.766908
7  33.45 -103.21  33.31  -96.40   633.751134
8  33.45 -103.21  36.22 -100.01   424.335874

修改

正如@MaxU在评论中所指出的那样,您可以以类似的方式使用numpy implementation of the Haversine formula以获得额外的性能。这应该等同于great_circle中的geopy函数。