具有如下数据
data = """
Class,Location,Long,Lat
A,ABC11,139.6295542,35.61144069
A,ABC20,139.630596,35.61045559
A,ABC03,139.6300307,35.61327781
B,ABC54,139.7787818,35.68847945
B,ABC05,139.7814447,35.6816882
B,ABC06,139.7788191,35.681865
B,ABC24,139.7790396,35.67781697
"""
df = pd.read_csv(StringIO(data))
每一行都包含与位置有关的数据。对于每个位置,需要按以下步骤查找到其他位置(行)的距离(为简便起见)
distance = sqrt((Long1-Long2)^2 + (Lat1-Lat2)^2)
如果是在大熊猫外面做的,我会做如下
import math
rows = df.to_dict('records')
# distance of each location w.r.t other locations excluding self
results = {}
for row in rows:
loc = row['Location']
results[loc] = {}
# get a new list excl the curr row
nrows = [row for row in rows if row['Location'] != loc]
for nrow in nrows:
dist = math.sqrt((row["Long"] - nrow["Long"])**2 + (row["Lat"] - nrow["Lat"])**2)
results[loc][nrow["Location"]] = dist
# find the location with min distance
fin_results = {}
for k, v in results.items():
fin_results[k] = {}
minValKey = min(v, key = v.get)
fin_results[k]["location"] = minValKey
fin_results[k]["dist"] = v[minValKey]
这将给出如下所示的输出,其中每个位置给出的是距该位置最近和最远的位置。
{'ABC11': {'location': 'ABC20', 'dist': 0.001433795400325211}, 'ABC20': {'location': 'ABC11', 'dist': 0.001433795400325211}, 'ABC03': {'location': 'ABC11', 'dist': 0.001897909941062068}, 'ABC54': {'location': 'ABC06', 'dist': 0.006614555169662396}, 'ABC05': {'location': 'ABC06', 'dist': 0.002631545857463665}, 'ABC06': {'location': 'ABC05', 'dist': 0.002631545857463665}, 'ABC24': {'location': 'ABC06', 'dist': 0.004054030973106164}}
虽然此功能可以正常运行,但它想知道pandas
的实现方式。
所需的输出
+----------+-------------------+----------------------------+
| location | nearest_location | nearest_location_distance |
+----------+-------------------+----------------------------+
| 'ABC11' | 'ABC20' | 0.001433795400325211 |
| 'ABC20' | 'ABC11' | 0.001433795400325211 |
| 'ABC03' | 'ABC11' | 0.001897909941062068 |
| 'ABC54' | 'ABC06' | 0.006614555169662396 |
| 'ABC05' | 'ABC06' | 0.002631545857463665 |
| 'ABC06' | 'ABC05' | 0.002631545857463665 |
| 'ABC24' | 'ABC06' | 0.004054030973106164 |
+----------+-------------------+----------------------------+
答案 0 :(得分:1)
您可以使用numpy
广播
long_ = df.Long.to_numpy()
lat = df.Lat.to_numpy()
distances = np.sqrt((long_ - long_[:, None]) ** 2 + (lat - lat[:,None]) **2)
dist_df = pd.DataFrame(distances, index=df.Location, columns=df.Location)
Location ABC11 ABC20 ABC03 ABC54 ABC05 ABC06 ABC24
ABC11 0.000000 0.001434 0.001898 0.167940 0.167348 0.165044 0.163559
ABC20 0.001434 0.000000 0.002878 0.167472 0.166822 0.164528 0.163012
ABC03 0.001898 0.002878 0.000000 0.166680 0.166151 0.163836 0.162385
ABC54 0.167940 0.167472 0.166680 0.000000 0.007295 0.006615 0.010666
ABC05 0.167348 0.166822 0.166151 0.007295 0.000000 0.002632 0.004558
ABC06 0.165044 0.164528 0.163836 0.006615 0.002632 0.000000 0.004054
ABC24 0.163559 0.163012 0.162385 0.010666 0.004558 0.004054 0.000000
m = dist_df[dist_df>0]
pd.concat([m.idxmin(1).rename('nearest_location'),
m.min(1).rename('nearest_location_distance'), ],1)
输出数据帧将类似于
nearest_location nearest_location_distance
Location
ABC11 ABC20 0.001434
ABC20 ABC11 0.001434
ABC03 ABC11 0.001898
ABC54 ABC06 0.006615
ABC05 ABC06 0.002632
ABC06 ABC05 0.002632
ABC24 ABC06 0.004054
这将找到一行到所有所有的距离。这就是我对问题的解释方式,不确定您的目标是什么。
答案 1 :(得分:1)
您可以使用scipy
的{{1}},实际上是@rafaelc编码的内容:
distance_matrix
输出:
from scipy.spatial import distance_matrix
dist_mat = distance_matrix(df[['Long','Lat']],df[['Long','Lat']])
# assign distance matrix with appropriate name
dist_mat = pd.DataFrame(dist_mat,
index=df.Location,
columns=df.Location)
# convert the data frame to dict
(dist_mat.where(dist_mat>0)
.agg(('idxmin', 'min'))
.to_dict()
)
如果仅需要数据框:
{'ABC11': {'idxmin': 'ABC20', 'min': 0.001433795400325211},
'ABC20': {'idxmin': 'ABC11', 'min': 0.001433795400325211},
'ABC03': {'idxmin': 'ABC11', 'min': 0.001897909941062068},
'ABC54': {'idxmin': 'ABC06', 'min': 0.006614555169662396},
'ABC05': {'idxmin': 'ABC06', 'min': 0.002631545857463665},
'ABC06': {'idxmin': 'ABC05', 'min': 0.002631545857463665},
'ABC24': {'idxmin': 'ABC06', 'min': 0.004054030973106164}}
输出:
(dist_mat.where(dist_mat>0)
.agg(('idxmin', 'min'))
.T
)
答案 2 :(得分:1)
您还可以使用df.iterrows:
distance_min=[]
location_min=[]
output_df=df.copy()
for i, col in df.iterrows():
dist=((col['Long']-df['Long']).pow(2)+(col['Lat']-df['Lat']).pow(2)).pow(1/2)
location_min.append(df.at[dist[dist>0].idxmin(),'Location'])
distance_min.append(dist[dist>0].min())
output_df['nearest_location']=location_min
output_df['nearest_location_distance']=distance_min
output_df=output_df.reindex(columns=['Location','nearest_location','nearest_location_distance'])
print(output_df)
Location nearest_location nearest_location_distance
0 ABC11 ABC20 0.001434
1 ABC20 ABC11 0.001434
2 ABC03 ABC11 0.001898
3 ABC54 ABC06 0.006615
4 ABC05 ABC06 0.002632
5 ABC06 ABC05 0.002632
6 ABC24 ABC06 0.004054
答案 3 :(得分:0)
由于ansev提出了相同的解决方案,因此更加完善
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(data))
df['result']= (df['Lat'].diff(-1).pow(2)+df['Long'].diff(-1).pow(2)).pow(1/2)