我有一个包含x和y坐标的数据框,以及一个如下所示的ID:
df = pd.DataFrame(np.random.randint(0,100,size=(26, 2)), columns=list('XY'))
df['id'] = list('abcdefghijklmnopqrstuvwxyz')
在保持O-D区域ID的同时,如何以一种pythonic的方式找到一个区域与所有其他区域之间的线性距离,而又没有一组嵌套循环?
输出应产生与以下结果相同的结果:
import math
def get_distance(start, end):
dist = math.hypot(end[0]-start[0], end[1]-start[1])
return dist
data = []
for index, row in df.iterrows():
start = [row['X'], row['Y']]
start_region = row['id']
for other_index, other_row in df.iterrows():
end = [other_row['X'], other_row['Y']]
end_rengion = other_row['id']
distance = get_distance(start, end)
entry = dict(
start_region = start_region,
end_rengion = end_rengion,
distance = distance
)
data.append(entry)
pd.DataFrame(data)
答案 0 :(得分:1)
您可以使用scipy.spatial.distance.cdist来执行此操作。它是用c编写的scipy函数,因此与嵌套python循环相比要快得多。
import pandas as pd
import numpy as np
from scipy.spatial import distance
import itertools
df = pd.DataFrame(np.random.randint(0,100,size=(26, 2)), columns=list('XY'))
ids = list('abcdefghijklmnopqrstuvwxyz')
df['id'] = ids
# get the points
points = df[["X", "Y"] ].values
# calculate distance of each point from every other point, row i contains contains distances for point i. distances[i, j] contains distance of point i from point j.
distances = distance.cdist(points, points, "euclidean")
distances = distances.flatten()
# get the start and end points
cartesian = list(itertools.product(ids, ids))
data = dict(
start_region = [x[0] for x in cartesian],
end_rengion = [x[1] for x in cartesian],
distance = distances
)
print(pd.DataFrame(data))
根据OP的评论进行编辑:您可以为cdist
提供自定义函数,因此修改以用google API点距离替换欧洲距离不难。