熊猫-根据其他行中的相对值计算新的列

时间:2019-10-08 17:26:42

标签: python pandas

具有如下数据

data = """
Class,Location,Long,Lat
A,ABC11,139.6295542,35.61144069
A,ABC20,139.630596,35.61045559
A,ABC03,139.6300307,35.61327781
B,ABC54,139.7787818,35.68847945
B,ABC05,139.7814447,35.6816882
B,ABC06,139.7788191,35.681865
B,ABC24,139.7790396,35.67781697
"""
df = pd.read_csv(StringIO(data))

每一行都包含与位置有关的数据。对于每个位置,需要按以下步骤查找到其他位置(行)的距离(为简便起见)

distance = sqrt((Long1-Long2)^2 + (Lat1-Lat2)^2)

如果是在大熊猫外面做的,我会做如下

import math

rows = df.to_dict('records')

# distance of each location w.r.t other locations excluding self
results = {}
for row in rows:
    loc = row['Location']
    results[loc] = {}
    # get a new list excl the curr row
    nrows = [row for row in rows if row['Location'] != loc]
    for nrow in nrows:
        dist = math.sqrt((row["Long"] - nrow["Long"])**2 + (row["Lat"] - nrow["Lat"])**2)
        results[loc][nrow["Location"]] = dist

# find the location with min distance 
fin_results = {}
for k, v in results.items():
    fin_results[k] = {}
    minValKey = min(v, key = v.get)
    fin_results[k]["location"] = minValKey 
    fin_results[k]["dist"] = v[minValKey]

这将给出如下所示的输出,其中每个位置给出的是距该位置最近和最远的位置。

{'ABC11': {'location': 'ABC20', 'dist': 0.001433795400325211}, 'ABC20': {'location': 'ABC11', 'dist': 0.001433795400325211}, 'ABC03': {'location': 'ABC11', 'dist': 0.001897909941062068}, 'ABC54': {'location': 'ABC06', 'dist': 0.006614555169662396}, 'ABC05': {'location': 'ABC06', 'dist': 0.002631545857463665}, 'ABC06': {'location': 'ABC05', 'dist': 0.002631545857463665}, 'ABC24': {'location': 'ABC06', 'dist': 0.004054030973106164}}

虽然此功能可以正常运行,但它想知道pandas的实现方式。

所需的输出

+----------+-------------------+----------------------------+
| location |  nearest_location |  nearest_location_distance |
+----------+-------------------+----------------------------+
| 'ABC11'  | 'ABC20'           | 0.001433795400325211       |
| 'ABC20'  | 'ABC11'           | 0.001433795400325211       |
| 'ABC03'  | 'ABC11'           | 0.001897909941062068       |
| 'ABC54'  | 'ABC06'           | 0.006614555169662396       |
| 'ABC05'  | 'ABC06'           | 0.002631545857463665       |
| 'ABC06'  | 'ABC05'           | 0.002631545857463665       |
| 'ABC24'  | 'ABC06'           | 0.004054030973106164       |
+----------+-------------------+----------------------------+

4 个答案:

答案 0 :(得分:1)

您可以使用numpy广播

long_ = df.Long.to_numpy()
lat   = df.Lat.to_numpy() 

distances = np.sqrt((long_ - long_[:, None]) ** 2 + (lat - lat[:,None]) **2)

dist_df = pd.DataFrame(distances, index=df.Location, columns=df.Location)

Location     ABC11     ABC20     ABC03     ABC54     ABC05     ABC06     ABC24

ABC11     0.000000  0.001434  0.001898  0.167940  0.167348  0.165044  0.163559
ABC20     0.001434  0.000000  0.002878  0.167472  0.166822  0.164528  0.163012
ABC03     0.001898  0.002878  0.000000  0.166680  0.166151  0.163836  0.162385
ABC54     0.167940  0.167472  0.166680  0.000000  0.007295  0.006615  0.010666
ABC05     0.167348  0.166822  0.166151  0.007295  0.000000  0.002632  0.004558
ABC06     0.165044  0.164528  0.163836  0.006615  0.002632  0.000000  0.004054
ABC24     0.163559  0.163012  0.162385  0.010666  0.004558  0.004054  0.000000

m = dist_df[dist_df>0]
pd.concat([m.idxmin(1).rename('nearest_location'),
           m.min(1).rename('nearest_location_distance'), ],1)

输出数据帧将类似于

        nearest_location  nearest_location_distance
Location                                            
ABC11               ABC20                   0.001434
ABC20               ABC11                   0.001434
ABC03               ABC11                   0.001898
ABC54               ABC06                   0.006615
ABC05               ABC06                   0.002632
ABC06               ABC05                   0.002632
ABC24               ABC06                   0.004054

这将找到一行到所有所有的距离。这就是我对问题的解释方式,不确定您的目标是什么。

答案 1 :(得分:1)

您可以使用scipy的{​​{1}},实际上是@rafaelc编码的内容:

distance_matrix

输出:

from scipy.spatial import distance_matrix

dist_mat = distance_matrix(df[['Long','Lat']],df[['Long','Lat']])

# assign distance matrix with appropriate name
dist_mat = pd.DataFrame(dist_mat, 
                        index=df.Location, 
                        columns=df.Location)

# convert the data frame to dict
(dist_mat.where(dist_mat>0)
     .agg(('idxmin', 'min'))
     .to_dict()
)

如果仅需要数据框:

{'ABC11': {'idxmin': 'ABC20', 'min': 0.001433795400325211},
 'ABC20': {'idxmin': 'ABC11', 'min': 0.001433795400325211},
 'ABC03': {'idxmin': 'ABC11', 'min': 0.001897909941062068},
 'ABC54': {'idxmin': 'ABC06', 'min': 0.006614555169662396},
 'ABC05': {'idxmin': 'ABC06', 'min': 0.002631545857463665},
 'ABC06': {'idxmin': 'ABC05', 'min': 0.002631545857463665},
 'ABC24': {'idxmin': 'ABC06', 'min': 0.004054030973106164}}    ​

输出:

(dist_mat.where(dist_mat>0)
     .agg(('idxmin', 'min'))
     .T
)

答案 2 :(得分:1)

您还可以使用df.iterrows

distance_min=[]
location_min=[]
output_df=df.copy()
for i, col in df.iterrows():
    dist=((col['Long']-df['Long']).pow(2)+(col['Lat']-df['Lat']).pow(2)).pow(1/2)
    location_min.append(df.at[dist[dist>0].idxmin(),'Location'])
    distance_min.append(dist[dist>0].min())

output_df['nearest_location']=location_min
output_df['nearest_location_distance']=distance_min
output_df=output_df.reindex(columns=['Location','nearest_location','nearest_location_distance'])
print(output_df)

 Location  nearest_location  nearest_location_distance
0    ABC11            ABC20                   0.001434
1    ABC20            ABC11                   0.001434
2    ABC03            ABC11                   0.001898
3    ABC54            ABC06                   0.006615
4    ABC05            ABC06                   0.002632
5    ABC06            ABC05                   0.002632
6    ABC24            ABC06                   0.004054

答案 3 :(得分:0)

由于ansev提出了相同的解决方案,因此更加完善

import pandas as pd 
from io import StringIO

df = pd.read_csv(StringIO(data))
df['result']= (df['Lat'].diff(-1).pow(2)+df['Long'].diff(-1).pow(2)).pow(1/2)