如何提取数据框段至少包含两个记录

时间:2018-07-04 12:49:06

标签: python-3.x pandas

给出一个数据框df

Trip_id   Latitude   Longitude  Acceleration    date_time    Transportation_Mode  
   1    39.98528333 116.3073667 186.6302183   5/26/2007 10:21       Walk   
   1    39.98521667 116.30955   20.69027793   5/26/2007 10:22       Walk   
   1    39.98513333 116.3097667 12.41329907   5/26/2007 10:22       Walk   
   1    39.9845     116.31      35.69170853   5/26/2007 10:25       Bike  
   1    39.98423333 116.3102333 28.01721471   5/26/2007 10:25       Bike  
   1    39.98403333 116.3104333 2921.070572   5/26/2007 10:25       Bike  
   1    39.98518333 116.3446    197.9064152   5/26/2007 10:29       Bike  
   1    39.96858333 116.3471167 409.3939156   5/26/2007 10:31       Walk   
   1    39.9649     116.3473333 174.0008214   5/26/2007 10:31       Walk   
   1    39.96335    116.3470333 500.6336985   5/26/2007 10:32       Walk   
   1    39.95885    116.3474    298.458933    5/26/2007 10:32       Car  
   1    39.95635    116.3486833 1445.861393   5/26/2007 10:32       Car  
   1    39.94336667 116.3499833 116.5939123   5/26/2007 10:34       Car  
   2    39.94231667 116.3499667 133.0986026   5/26/2007 10:34       Walk   
   2    39.94123333 116.3493    1503.18099    5/26/2007 10:34       Walk   
   2    39.9277     116.3497667 12.37086539   5/26/2007 10:36       Car  
   2    39.91055    116.35045   7.897042746   5/26/2007 10:38       Car 

我想获得最终的数据帧df1

Trip_id Segid   Transportation_Mode  Start_date_time     End_date_time   Mean_Acceleration  Top_Acceleration1   Top_Acceleration2 
   1       1           Walk          5/26/2007 10:21    5/26/2007 10:22  73.24459843          186.6302183        20.69027793  
   1       2           Bike          5/26/2007 10:25    5/26/2007 10:29  795.6714775          2921.070572        197.9064152  
   1       3           Walk          5/26/2007 10:31    5/26/2007 10:32  361.3428118          500.6336985        409.3939156  
   1       4           Car           5/26/2007 10:32    5/26/2007 10:34  620.3047461          1445.861393        298.458933  
   2       1           Walk          5/26/2007 10:34    5/26/2007 10:34  818.1397964          1503.18099         133.0986026  
   2       2           Car           5/26/2007 10:36    5/26/2007 10:38  10.13395407          12.37086539        7.897042746    

i)分组数据帧,以使连续的Transportation_Mode为一组/段。
ii) df1是段的列表,每个段包含每个段的start_date_time和end_date_time,平均加速度和每个段的前2个加速度。
iii)行程由多个路段组成,每个路段包含一种运输方式。

1 个答案:

答案 0 :(得分:1)

我认为自定义函数需要DataFrameGroupBy.agg,每个组至少需要2个值才能获得acceleration的top2:

#convert column to datetimes
df['date_time'] = pd.to_datetime(df['date_time'])

#create helper column for consecutive segment
s = df['Transportation_Mode'].ne(df['Transportation_Mode'].shift()).cumsum().rename('g')

#remove all non duplicated rows per segment
df = df[s.duplicated(keep=False)]

#get top1 and top2 values
f1 = lambda x: x.sort_values(ascending=False).iloc[0]
f1.__name__ = 'Top_1'
f2 = lambda x: x.sort_values(ascending=False).iloc[1]
f2.__name__ = 'Top_2'

d = {'date_time':['first','last'], 'Acceleration':['mean', f1, f2]}

df1 = df.groupby(['Trip_id','Transportation_Mode',s], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=True).reset_index()

print (df1)
   Trip_id Transportation_Mode     date_time_first      date_time_last  \
0        1                Walk 2017-05-26 10:21:00 2017-05-26 10:22:00   
1        1                Bike 2017-05-26 10:25:00 2017-05-26 10:29:00   
2        1                Walk 2017-05-26 10:31:00 2017-05-26 10:32:00   
3        1                 Car 2017-05-26 10:32:00 2017-05-26 10:34:00   
4        2                Walk 2017-05-26 10:34:00 2017-05-26 10:34:00   
5        2                 Car 2017-05-26 10:36:00 2017-05-26 10:38:00   

   Acceleration_mean  Acceleration_Top_1  Acceleration_Top_2  
0          73.244598          186.630218           20.690278  
1         795.671478         2921.070572          197.906415  
2         361.342812          500.633699          409.393916  
3         620.304746         1445.861393          298.458933  
4         818.139796         1503.180990          133.098603  
5          10.133954           12.370865            7.897043