加快获取两个纬度和经度之间的距离

时间:2019-07-24 08:32:23

标签: python-3.x pandas gis geopy

我有两个包含Lat和Lon的DataFrame。我想找到一个(Lat, Lon)对与另一个DataFrame的 ALL (Lat, Lon)之间的距离,并获取最小距离。我正在使用的软件包geopy。代码如下:

from geopy import distance
import numpy as np

distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
    target = (row1["LAT"], row1["LON"])
    count = count + 1
    print(count)
    for id2, row2 in df2.iterrows():
        point = (row2["LAT"], row2["LON"])
        distanceMiles.append(distance.distance(target, point).miles)

    closestPoint = np.argmin(distanceMiles)
    distanceMiles = []

问题是df1168K行,而df21200行。如何使其更快?

4 个答案:

答案 0 :(得分:1)

留在这里以防将来有人需要它:

如果您只需要最小距离,那么您不必对所有对进行暴力破解。有一些数据结构可以帮助你以 O(n*log(n)) 的时间复杂度解决这个问题,这比蛮力方法要快得多。

例如,您可以使用广义的 KNearestNeighbors(k=1)算法来实现这一点,前提是您注意点在球体上,而不是平面上。见this SO answer for an example implementation using sklearn

似乎也有一些库可以解决这个问题,例如 sknniGriSPy

Here 的另一个问题也涉及理论。

答案 1 :(得分:0)

如果您使用itertools而不是显式的for循环,这应该运行得更快。内联注释应有助于您理解每个步骤中发生的情况。

import numpy as np
import itertools
from geopy import distance


#Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})


#Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
target = list(zip(df1['LAT'], df1['LON']))
point = list(zip(df2['LAT'], df2['LON']))


#Product function in itertools does a cross product between the 2 iteratables
#You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
product = list(itertools.product(target, point)])

#starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
len(geo_dist)
50
geo_dist = [42.430772028845716,
 44.29982320107605,
 25.88823239877388,
 23.877570442142783,
 29.9351451072828,
 ...]

最后, 如果您正在处理大量数据集,那么我建议使用多处理库将itertools.starmap映射到不同的内核,并异步计算距离值。 Python Multiprocessing库现在支持星图。

答案 2 :(得分:0)

如果您需要通过强力检查所有配对,我认为以下方法是最好的选择。
直接在列上循环通常比iterrows快一点,并且向量化方法替换内部循环也可以节省时间。

for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
    target = (lat1, lon1)
    count = count + 1
    #    print(count) #printing is also time expensive
    df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
    closestpoint = df2['dist'].min() #if you want the minimum distance
    closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.

答案 3 :(得分:0)

geopy.distance.distance uses geodesic algorithm by default,虽然速度较慢但更准确。如果您可以用准确性来换取速度,则可以使用great_circle,它快20倍左右:

In [4]: %%timeit
   ...: distance.distance(newport_ri, cleveland_oh).miles
   ...:
236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %%timeit
   ...: distance.great_circle(newport_ri, cleveland_oh).miles
   ...:
13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

您还可以使用多重处理来并行化计算:

from multiprocessing import Pool
from geopy import distance
import numpy as np


def compute(points):
    target, point = points
    return distance.great_circle(target, point).miles


with Pool() as pool:
    for id1, row1 in df1.iterrows():
        target = (row1["LAT"], row1["LON"])
        distanceMiles = pool.map(
            compute,
            (
                (target, (row2["LAT"], row2["LON"]))
                for id2, row2 in df2.iterrows()
            )
        )
        closestPoint = np.argmin(distanceMiles)