给出一个数据框df,这是我原始数据框的子集。
Transportation_Mode time_delta trip_id segmentid Vincenty_distance velocity acceleration jerk
walk 1 1 1 1.551676553 1.551676553 0.550163852 -1.017629555
walk 1 1 1 1.70920675 1.70920675 0.16257622 -0.39166534
walk 1 1 1 1.871782971 1.871782971 -0.22908912 -0.734438511
walk 12 1 1 23.16466284 1.93038857 0.324972586 -0.331839143
walk 1 1 1 5.830059603 5.830059603 -3.657097132 2.614438854
bus 1 16 5 8.418372046 8.418372046 -7.259019484 7.40735053
bus 23 16 5 26.66510892 1.159352562 0.148331046 -0.036318522
bus 1 16 5 4.570966614 4.570966614 -0.68699497 -0.889126918
我想在groupby上计算速度和加速度之间的协方差,结果数据帧df1看起来像这样
Trip_id Segmentid Transportation_Mode Covariance
1 1 walk
16 1 bus
我正在尝试使用这种方式解决
grp = df.groupby(['trip_id','Transportation_Mode','segmentid'])
df1['Covariance'] = grp.apply(lambda x: x['velocity'].cov(x['acceleration']))
但是它给出了一个错误,
TypeError: incompatible index of inserted column with frame index
下面给出了详细的代码
grp = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'])
df = grp.filter(lambda x: len(x)>3) # filter all groups whose length is greater than 3
#get top1 and top2 values
f1 = lambda x: x.sort_values(ascending=False).iloc[0]
f1.__name__ = 'Top_1'
#for top2 return nan if not exist
f2 = lambda x: x.sort_values(ascending=False).iloc[1]
f2.__name__ = 'Top_2'
f3 = lambda x: x.sort_values(ascending=False).iloc[2]
f3.__name__ = 'Top_3'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
f8 = lambda x: x.quantile(0.85)
f8.__name__ = '85_percentile'
d = {'date_time':['first','last', 'count'],
'acceleration':['mean', f1, f2, f3,'count', f8, 'median', 'min'],
'velocity':[f1, f2, f3, f5, 'sum' ,'count', f8, 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index()
现在,我想计算速度和加速度之间的协方差,其中涉及2列。所以不知道如何在聚合函数中执行此操作?或为此创建单独的列。
df_cv = pd.DataFrame()
df_cv['Covariance'] = grp.apply(lambda x: x['velocity'].cov(x['acceleration']))
df_cv = df_cv.reset_index()
df1['cov'] = df_cv['Covariance']
当我附加协方差列时,组未对齐。在第15行上,组(userid = 141,trip_id = 10,Transportation_Mode = subway,segmentid = 2)与组的协方差相关联(userid = 141,trip_id = 1,Transportation_Mode = walk,segmentid = 1)
链接中提供了数据帧df的完整输入数据 https://drive.google.com/file/d/1JjvS7igTmrtLA4E5Rs5D6tsdAXqzpYqX/view
答案 0 :(得分:2)
请检查以下代码:
grp = df.groupby(['trip_id','Transportation_Mode','segmentid'])
df_cv = pd.DataFrame()
df_cv['Covariance'] = grp.apply(lambda x: x['velocity'].cov(x['acceleration']))
这将提供以下数据框:
Covariance
trip_id Transportation_Mode segmentid
1 walk 1 -3.161471
16 bus 5 -13.650859
请注意,数据帧的索引为[trip_id Transportation_Mode segmentid]
,该索引来自上一个groupby
操作。在原始df1
中,索引是不同的,这就是错误的根源。因此,您需要匹配索引,例如,如果df1
具有“正常”索引,则按
df_cv = df_cv.reset_index()
df1 = df1.append(df_cv)
或通过其他类型的merge
操作