我有两个独立的数据集df
和df2
,每个数据集都有longitude
和latitude
列。我要做的是找到最接近df
中的点的df2
中的点,并在km
中计算它们之间的距离,并将每个值附加到新列中的df2
。
我想出了一个解决方案,但请记住df
有+700,000
行,df2
有60,000
行,所以我的解决方案也会采取行动很难计算。我能想出的唯一解决方案是使用双for
循环...
def compute_shortest_dist(df, df2):
# array to store all closest distances
shortest_dist = []
# radius of earth (used for calculation)
R = 6373.0
for i in df2.index:
# keeps track of current minimum distance
min_dist = -1
# latitude and longitude from df2
lat1 = df2.ix[i]['Latitude']
lon1 = df2.ix[i]['Longitude']
for j in df.index:
# the following is just the calculation necessary
# to calculate the distance between each point in km
lat2 = df.ix[j]['Latitude']
lon2 = df.ix[j]['Longitude']
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
# store new shortest distance
if min_dist == -1 or distance > min_dist:
min_dist = distance
# append shortest distance to array
shortest_dist.append(min_dist)
这个函数计算时间太长,我知道必须有一个更有效的方法,但我不是很擅长pandas
语法。
我感谢任何帮助。
答案 0 :(得分:2)
您可以在numpy
中编写内部循环,这应该会显着加快速度:
import numpy as np
def compute_shortest_dist(df, df2):
# array to store all closest distances
shortest_dist = []
# radius of earth (used for calculation)
R = 6373.0
lat1 = df['Latitude']
lon1 = df['Longitude']
for i in df2.index:
# the following is just the calculation necessary
# to calculate the distance between each point in km
lat2 = df2.loc[i, 'Latitude']
dlat = lat1 - lat2
dlon = lon1 - df2.loc[i, 'Longitude']
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
distance = 2* R * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
# append shortest distance to array
shortest_dist.append(distance.min())
return shortest_dist