给出一个数据框df
Trip_id Latitude Longitude Acceleration date_time Transportation_Mode
1 39.98528333 116.3073667 186.6302183 5/26/2007 10:21 Walk
1 39.98521667 116.30955 20.69027793 5/26/2007 10:22 Walk
1 39.98513333 116.3097667 12.41329907 5/26/2007 10:22 Walk
1 39.9845 116.31 35.69170853 5/26/2007 10:25 Bike
1 39.98423333 116.3102333 28.01721471 5/26/2007 10:25 Bike
1 39.98403333 116.3104333 2921.070572 5/26/2007 10:25 Bike
1 39.98518333 116.3446 197.9064152 5/26/2007 10:29 Bike
1 39.96858333 116.3471167 409.3939156 5/26/2007 10:31 Walk
1 39.9649 116.3473333 174.0008214 5/26/2007 10:31 Walk
1 39.96335 116.3470333 500.6336985 5/26/2007 10:32 Walk
1 39.95885 116.3474 298.458933 5/26/2007 10:32 Car
1 39.95635 116.3486833 1445.861393 5/26/2007 10:32 Car
1 39.94336667 116.3499833 116.5939123 5/26/2007 10:34 Car
2 39.94231667 116.3499667 133.0986026 5/26/2007 10:34 Walk
2 39.94123333 116.3493 1503.18099 5/26/2007 10:34 Walk
2 39.9277 116.3497667 12.37086539 5/26/2007 10:36 Car
2 39.91055 116.35045 7.897042746 5/26/2007 10:38 Car
我想获得最终的数据帧df1
Trip_id Segid Transportation_Mode Start_date_time End_date_time Mean_Acceleration Top_Acceleration1 Top_Acceleration2
1 1 Walk 5/26/2007 10:21 5/26/2007 10:22 73.24459843 186.6302183 20.69027793
1 2 Bike 5/26/2007 10:25 5/26/2007 10:29 795.6714775 2921.070572 197.9064152
1 3 Walk 5/26/2007 10:31 5/26/2007 10:32 361.3428118 500.6336985 409.3939156
1 4 Car 5/26/2007 10:32 5/26/2007 10:34 620.3047461 1445.861393 298.458933
2 1 Walk 5/26/2007 10:34 5/26/2007 10:34 818.1397964 1503.18099 133.0986026
2 2 Car 5/26/2007 10:36 5/26/2007 10:38 10.13395407 12.37086539 7.897042746
i)分组数据帧,以使连续的Transportation_Mode为一组/段。
ii) df1是段的列表,每个段包含每个段的start_date_time和end_date_time,平均加速度和每个段的前2个加速度。
iii)行程由多个路段组成,每个路段包含一种运输方式。
答案 0 :(得分:1)
我认为自定义函数需要DataFrameGroupBy.agg
,每个组至少需要2个值才能获得acceleration
的top2:
#convert column to datetimes
df['date_time'] = pd.to_datetime(df['date_time'])
#create helper column for consecutive segment
s = df['Transportation_Mode'].ne(df['Transportation_Mode'].shift()).cumsum().rename('g')
#remove all non duplicated rows per segment
df = df[s.duplicated(keep=False)]
#get top1 and top2 values
f1 = lambda x: x.sort_values(ascending=False).iloc[0]
f1.__name__ = 'Top_1'
f2 = lambda x: x.sort_values(ascending=False).iloc[1]
f2.__name__ = 'Top_2'
d = {'date_time':['first','last'], 'Acceleration':['mean', f1, f2]}
df1 = df.groupby(['Trip_id','Transportation_Mode',s], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=True).reset_index()
print (df1)
Trip_id Transportation_Mode date_time_first date_time_last \
0 1 Walk 2017-05-26 10:21:00 2017-05-26 10:22:00
1 1 Bike 2017-05-26 10:25:00 2017-05-26 10:29:00
2 1 Walk 2017-05-26 10:31:00 2017-05-26 10:32:00
3 1 Car 2017-05-26 10:32:00 2017-05-26 10:34:00
4 2 Walk 2017-05-26 10:34:00 2017-05-26 10:34:00
5 2 Car 2017-05-26 10:36:00 2017-05-26 10:38:00
Acceleration_mean Acceleration_Top_1 Acceleration_Top_2
0 73.244598 186.630218 20.690278
1 795.671478 2921.070572 197.906415
2 361.342812 500.633699 409.393916
3 620.304746 1445.861393 298.458933
4 818.139796 1503.180990 133.098603
5 10.133954 12.370865 7.897043