熊猫中不同元素的成对距离

时间:2018-10-22 20:49:26

标签: python pandas optimization graph

我有一个数据帧,其中包含一些不同的元素,由ID标识。对于每个LAT,都提供了LON。下面提供了一个示例:

ID       LAT        LON

2426  0.351649  36.921941
2451  0.351666  36.921939
2457  0.351687  36.921966

我想有一个由元组(ID1,ID2)和距离标识为值的字典:

{(2426,2451):d1, (2426,2457):d2, (2451,2457):d3}

现在我正在使用以下代码计算每对之间的距离:

distances = {}
ids = to_network['ID'].values
for id_1 in ids:
    ids = np.delete(ids, np.where(ids == id_1), axis=0)
    for id_2 in ids:
        distances[(id_1,id_2)] = compute_distance_m(to_network.loc[(to_network['ID'] == id_1),'LAT'].values[0],to_network.loc[(to_network['ID'] == id_1),'LON'].values[0],to_network.loc[(to_network['ID'] == id_2),'LAT'].values[0],to_network.loc[(to_network['ID'] == id_2),'LON'].values[0])

# Result in m
def compute_distance_m(lat1,lon1,lat2,lon2):
    coords_1 = (lat1, lon1)
    coords_2 = (lat2, lon2)
    return geopy.distance.vincenty(coords_1, coords_2).km*1000

#returns
{(2426, 2451): 1.9917619328904765,
 (2426, 2457): 5.083739036769186,
 (2451, 2457): 3.7473346626876483}

问题在于这段代码确实很慢,并且我的数据集中有数十亿个实例,因此我一直在寻找一种可以直接在初始数据帧上运行的更好的版本。

1 个答案:

答案 0 :(得分:2)

来自scipygeopy

from geopy.distance import vincenty
from scipy import spatial
ary=spatial.distance.cdist(df[['LAT','LON']], df[['LAT','LON']], metric=lambda u, v: vincenty(u, v).kilometers)
disdf=pd.DataFrame(ary,columns=df.ID,index=df.ID)
disdf
Out[57]: 
ID        2426      2451      2457
ID                                
2426  0.000000  0.001893  0.005040
2451  0.001893  0.000000  0.003798
2457  0.005040  0.003798  0.000000

更新

idx = np.triu_indices(len(ary))
ary[idx] = np.nan
pd.DataFrame(ary,columns=df.ID,index=df.ID).stack().to_dict()
Out[67]: 
{(2451, 2426): 0.0018929013674396785,
 (2457, 2426): 0.005039829336784733,
 (2457, 2451): 0.0037980539470027124}