在保持meta_data的同时生成所有点之间距离的Pythonic(也是性能最高的)方法?

时间:2018-10-27 10:17:14

标签: python pandas

我有一个包含x和y坐标的数据框,以及一个如下所示的ID:

df = pd.DataFrame(np.random.randint(0,100,size=(26, 2)), columns=list('XY'))
df['id'] = list('abcdefghijklmnopqrstuvwxyz')

在保持O-D区域ID的同时,如何以一种pythonic的方式找到一个区域与所有其他区域之间的线性距离,而又没有一组嵌套循环?

输出应产生与以下结果相同的结果:

import math
def get_distance(start, end):
    dist = math.hypot(end[0]-start[0], end[1]-start[1])
    return dist

data = []

for index, row in df.iterrows():
    start = [row['X'], row['Y']]
    start_region = row['id']

    for other_index, other_row in df.iterrows():
        end = [other_row['X'], other_row['Y']]
        end_rengion = other_row['id']
        distance = get_distance(start, end)

        entry = dict(
            start_region = start_region,
            end_rengion = end_rengion,
            distance = distance
        )

        data.append(entry)

pd.DataFrame(data)

1 个答案:

答案 0 :(得分:1)

您可以使用scipy.spatial.distance.cdist来执行此操作。它是用c编写的scipy函数,因此与嵌套python循环相比要快得多。

import pandas as pd
import numpy as np
from scipy.spatial import distance
import itertools

df = pd.DataFrame(np.random.randint(0,100,size=(26, 2)), columns=list('XY'))
ids = list('abcdefghijklmnopqrstuvwxyz')
df['id'] = ids

# get the points
points = df[["X", "Y"] ].values

# calculate distance of each point from every other point, row i contains contains distances for point i. distances[i, j] contains distance of point i from point j.
distances = distance.cdist(points, points, "euclidean")
distances = distances.flatten()

# get the start and end points
cartesian = list(itertools.product(ids, ids))

data = dict(
            start_region = [x[0] for x in cartesian],
            end_rengion = [x[1] for x in cartesian],
            distance = distances
        )
print(pd.DataFrame(data))

根据OP的评论进行编辑:您可以为cdist提供自定义函数,因此修改以用google API点距离替换欧洲距离不难。