两列之间的Groupby协方差

时间:2018-08-05 07:59:23

标签: python-3.x pandas numpy pandas-groupby

给出一个数据框df,这是我原始数据框的子集。

Transportation_Mode time_delta  trip_id segmentid   Vincenty_distance   velocity       acceleration       jerk
         walk           1          1        1          1.551676553     1.551676553     0.550163852    -1.017629555
         walk           1          1        1          1.70920675      1.70920675      0.16257622     -0.39166534
         walk           1          1        1          1.871782971     1.871782971    -0.22908912     -0.734438511
         walk          12          1        1          23.16466284     1.93038857      0.324972586    -0.331839143
         walk           1          1        1          5.830059603     5.830059603    -3.657097132     2.614438854
         bus            1         16        5          8.418372046     8.418372046    -7.259019484     7.40735053
         bus           23         16        5          26.66510892     1.159352562     0.148331046    -0.036318522
         bus            1         16        5          4.570966614     4.570966614    -0.68699497     -0.889126918

我想在groupby上计算速度和加速度之间的协方差,结果数据帧df1看起来像这样

Trip_id Segmentid   Transportation_Mode  Covariance
   1        1          walk            
   16       1          bus       

我正在尝试使用这种方式解决

grp = df.groupby(['trip_id','Transportation_Mode','segmentid'])
df1['Covariance'] = grp.apply(lambda x: x['velocity'].cov(x['acceleration']))      

但是它给出了一个错误,

  TypeError: incompatible index of inserted column with frame index

下面给出了详细的代码

grp = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'])
df = grp.filter(lambda x: len(x)>3) # filter all groups whose length is  greater than 3

#get top1 and top2 values
f1 = lambda x: x.sort_values(ascending=False).iloc[0]
f1.__name__ = 'Top_1'
#for top2 return nan if not exist
f2 = lambda x: x.sort_values(ascending=False).iloc[1]
f2.__name__ = 'Top_2'

f3 = lambda x: x.sort_values(ascending=False).iloc[2] 
f3.__name__ = 'Top_3'

f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'

f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'

f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'

f8 = lambda x: x.quantile(0.85)
f8.__name__ = '85_percentile'

d = {'date_time':['first','last', 'count'], 
 'acceleration':['mean', f1, f2, f3,'count', f8, 'median', 'min'], 
 'velocity':[f1, f2, f3, f5, 'sum' ,'count', f8, 'median', 'min'], 
 'velocity_rate':f6,
 'acc_rate':f7,          
 'Vincenty_distance':'sum'}

df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)

#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index()

现在,我想计算速度和加速度之间的协方差,其中涉及2列。所以不知道如何在聚合函数中执行此操作?或为此创建单独的列。

df_cv = pd.DataFrame()
df_cv['Covariance'] = grp.apply(lambda x: x['velocity'].cov(x['acceleration']))
df_cv = df_cv.reset_index()
df1['cov'] = df_cv['Covariance']

当我附加协方差列时,组未对齐。在第15行上,组(userid = 141,trip_id = 10,Transportation_Mode = subway,segmentid = 2)与组的协方差相关联(userid = 141,trip_id = 1,Transportation_Mode = walk,segmentid = 1)

链接中提供了数据帧df的完整输入数据 https://drive.google.com/file/d/1JjvS7igTmrtLA4E5Rs5D6tsdAXqzpYqX/view

1 个答案:

答案 0 :(得分:2)

请检查以下代码:

grp = df.groupby(['trip_id','Transportation_Mode','segmentid'])
df_cv = pd.DataFrame()
df_cv['Covariance'] = grp.apply(lambda x: x['velocity'].cov(x['acceleration']))      

这将提供以下数据框:

                                       Covariance
trip_id Transportation_Mode segmentid            
1       walk                1           -3.161471
16      bus                 5          -13.650859

请注意,数据帧的索引为[trip_id Transportation_Mode segmentid],该索引来自上一个groupby操作。在原始df1中,索引是不同的,这就是错误的根源。因此,您需要匹配索引,例如,如果df1具有“正常”索引,则按

df_cv = df_cv.reset_index()
df1 = df1.append(df_cv) 

或通过其他类型的merge操作