我有两个包含Lat和Lon的DataFrame。我想找到一个(Lat, Lon)
对与另一个DataFrame的 ALL (Lat, Lon)
之间的距离,并获取最小距离。我正在使用的软件包geopy
。代码如下:
from geopy import distance
import numpy as np
distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
count = count + 1
print(count)
for id2, row2 in df2.iterrows():
point = (row2["LAT"], row2["LON"])
distanceMiles.append(distance.distance(target, point).miles)
closestPoint = np.argmin(distanceMiles)
distanceMiles = []
问题是df1
有168K
行,而df2
有1200
行。如何使其更快?
答案 0 :(得分:1)
留在这里以防将来有人需要它:
如果您只需要最小距离,那么您不必对所有对进行暴力破解。有一些数据结构可以帮助你以 O(n*log(n)) 的时间复杂度解决这个问题,这比蛮力方法要快得多。
例如,您可以使用广义的 KNearestNeighbors(k=1)算法来实现这一点,前提是您注意点在球体上,而不是平面上。见this SO answer for an example implementation using sklearn。
似乎也有一些库可以解决这个问题,例如 sknni 和 GriSPy。
Here 的另一个问题也涉及理论。
答案 1 :(得分:0)
如果您使用itertools而不是显式的for循环,这应该运行得更快。内联注释应有助于您理解每个步骤中发生的情况。
import numpy as np
import itertools
from geopy import distance
#Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})
#Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
target = list(zip(df1['LAT'], df1['LON']))
point = list(zip(df2['LAT'], df2['LON']))
#Product function in itertools does a cross product between the 2 iteratables
#You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
product = list(itertools.product(target, point)])
#starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
len(geo_dist)
50
geo_dist = [42.430772028845716,
44.29982320107605,
25.88823239877388,
23.877570442142783,
29.9351451072828,
...]
最后, 如果您正在处理大量数据集,那么我建议使用多处理库将itertools.starmap映射到不同的内核,并异步计算距离值。 Python Multiprocessing库现在支持星图。
答案 2 :(得分:0)
如果您需要通过强力检查所有配对,我认为以下方法是最好的选择。
直接在列上循环通常比iterrows
快一点,并且向量化方法替换内部循环也可以节省时间。
for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
target = (lat1, lon1)
count = count + 1
# print(count) #printing is also time expensive
df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
closestpoint = df2['dist'].min() #if you want the minimum distance
closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.
答案 3 :(得分:0)
geopy.distance.distance
uses geodesic
algorithm by default,虽然速度较慢但更准确。如果您可以用准确性来换取速度,则可以使用great_circle
,它快20倍左右:
In [4]: %%timeit
...: distance.distance(newport_ri, cleveland_oh).miles
...:
236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %%timeit
...: distance.great_circle(newport_ri, cleveland_oh).miles
...:
13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
您还可以使用多重处理来并行化计算:
from multiprocessing import Pool
from geopy import distance
import numpy as np
def compute(points):
target, point = points
return distance.great_circle(target, point).miles
with Pool() as pool:
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
distanceMiles = pool.map(
compute,
(
(target, (row2["LAT"], row2["LON"]))
for id2, row2 in df2.iterrows()
)
)
closestPoint = np.argmin(distanceMiles)