我有一个数据帧,其中包含一些不同的元素,由ID标识。对于每个LAT,都提供了LON。下面提供了一个示例:
ID LAT LON
2426 0.351649 36.921941
2451 0.351666 36.921939
2457 0.351687 36.921966
我想有一个由元组(ID1,ID2)和距离标识为值的字典:
{(2426,2451):d1, (2426,2457):d2, (2451,2457):d3}
现在我正在使用以下代码计算每对之间的距离:
distances = {}
ids = to_network['ID'].values
for id_1 in ids:
ids = np.delete(ids, np.where(ids == id_1), axis=0)
for id_2 in ids:
distances[(id_1,id_2)] = compute_distance_m(to_network.loc[(to_network['ID'] == id_1),'LAT'].values[0],to_network.loc[(to_network['ID'] == id_1),'LON'].values[0],to_network.loc[(to_network['ID'] == id_2),'LAT'].values[0],to_network.loc[(to_network['ID'] == id_2),'LON'].values[0])
# Result in m
def compute_distance_m(lat1,lon1,lat2,lon2):
coords_1 = (lat1, lon1)
coords_2 = (lat2, lon2)
return geopy.distance.vincenty(coords_1, coords_2).km*1000
#returns
{(2426, 2451): 1.9917619328904765,
(2426, 2457): 5.083739036769186,
(2451, 2457): 3.7473346626876483}
问题在于这段代码确实很慢,并且我的数据集中有数十亿个实例,因此我一直在寻找一种可以直接在初始数据帧上运行的更好的版本。
答案 0 :(得分:2)
来自scipy
和geopy
from geopy.distance import vincenty
from scipy import spatial
ary=spatial.distance.cdist(df[['LAT','LON']], df[['LAT','LON']], metric=lambda u, v: vincenty(u, v).kilometers)
disdf=pd.DataFrame(ary,columns=df.ID,index=df.ID)
disdf
Out[57]:
ID 2426 2451 2457
ID
2426 0.000000 0.001893 0.005040
2451 0.001893 0.000000 0.003798
2457 0.005040 0.003798 0.000000
更新
idx = np.triu_indices(len(ary))
ary[idx] = np.nan
pd.DataFrame(ary,columns=df.ID,index=df.ID).stack().to_dict()
Out[67]:
{(2451, 2426): 0.0018929013674396785,
(2457, 2426): 0.005039829336784733,
(2457, 2451): 0.0037980539470027124}