Question

我有2套地理代码作为熊猫系列，我试图找到从集合B中的点获得集合A中的点的最小欧几里德距离的最快方法。那就是：最接近40.748043＆amp;从第二组开始-73.992953，依此类推。非常感谢任何建议/帮助。

Set A:
    print(latitude1)
    print(longitude1)

    0    40.748043
    1    42.361016

    Name: latitude, dtype: float64
    0    -73.992953
    1    -71.020005
    Name: longitude, dtype: float64

Set B:
    print(latitude2)
    print(longitude2)

    0    42.50729
    1    42.50779
    2    25.56473
    3    25.78953
    4    25.33132
    5    25.06570
    6    25.59246
    7    25.61955
    8    25.33737
    9    24.11028
    Name: latitude, dtype: float64
    0     1.53414
    1     1.52109
    2    55.55517
    3    55.94320
    4    56.34199
    5    55.17128
    6    56.26176
    7    56.27291
    8    55.41206
    9    52.73056
    Name: longitude, dtype: float64

Answer 1

这是仅使用numpy.linalg.norm的一种方法。

import pandas as pd, numpy as np

df1['coords1'] = list(zip(df1['latitude1'], df1['longitude1']))
df2['coords2'] = list(zip(df2['latitude2'], df2['longitude2']))

def calc_min(x):
    amin = np.argmin([np.linalg.norm(np.array(x)-np.array(y)) for y in df2['coords2']])
    return df2['coords2'].iloc[amin]

df1['closest'] = df1['coords1'].map(calc_min)

#    latitude1  longitude1                  coords1              closest
# 0  40.748043  -73.992953  (40.748043, -73.992953)  (42.50779, 1.52109)
# 1  42.361016  -71.020005  (42.361016, -71.020005)  (42.50779, 1.52109)
# 2  25.361016   54.000000        (25.361016, 54.0)  (25.0657, 55.17128)

<强>设置

from io import StringIO

mystr1 = """latitude1|longitude1
40.748043|-73.992953
42.361016|-71.020005
25.361016|54.0000
"""

mystr2 = """latitude2|longitude2
42.50729|1.53414
42.50779|1.52109
25.56473|55.55517
25.78953|55.94320
25.33132|56.34199
25.06570|55.17128
25.59246|56.26176
25.61955|56.27291
25.33737|55.41206
24.11028|52.73056"""

df1 = pd.read_csv(StringIO(mystr1), sep='|')
df2 = pd.read_csv(StringIO(mystr2), sep='|')

如果性能存在问题，您可以通过底层的numpy数组轻松地对此计算进行矢量化。

Answer 2

您可以尝试使用geopy库。

https://pypi.python.org/pypi/geopy

以下是文档中的示例。

>>> from geopy.distance import vincenty
>>> newport_ri = (41.49008, -71.312796)
>>> cleveland_oh = (41.499498, -81.695391)
>>> print(vincenty(newport_ri, cleveland_oh).miles)
538.3904451566326

vincenty是vincenty距离

https://en.wikipedia.org/wiki/Vincenty%27s_formulae

Answer 3

对于那些最近点计算，通常有效的方法是使用基于kd树的快速最近邻居查找之一。使用Cython-powered implementation，我们会有一种方法 -

from scipy.spatial import cKDTree

def closest_pts(setA_lat, setA_lng, setB_lat, setB_lng):
    a_x = setA_lat.values
    a_y = setA_lng.values
    b_x = setB_lat.values
    b_y = setB_lng.values

    a = np.c_[a_x, a_y]
    b = np.c_[b_x, b_y]
    indx = cKDTree(b).query(a,k=1)[1]
    return pd.Series(b_x[indx]), pd.Series(b_y[indx])

示例运行 -

1）输入：

In [106]: setA_lat
Out[106]: 
0    40.748043
1    42.361016
dtype: float64

In [107]: setA_lng
Out[107]: 
0   -73.992953
1   -71.020005
dtype: float64

In [108]: setB_lat
Out[108]: 
0    42.460000
1     0.645894
2     0.437587
3    40.460000
4     0.963663
dtype: float64

In [109]: setB_lng
Out[109]: 
0   -71.000000
1     0.925597
2     0.071036
3   -72.000000
4     0.020218
dtype: float64

2）输出：

In [110]: c_x,c_y = closest_pts(setA_lat, setA_lng, setB_lat, setB_lng)

In [111]: c_x
Out[111]: 
0    40.46
1    42.46
dtype: float64

In [112]: c_y
Out[112]: 
0   -72.0
1   -71.0
dtype: float64

使用numpy查找最近的位置

3 个答案: