聚合,分组和联合国堆栈到许多列

时间:2019-12-07 10:32:10

标签: python python-3.x pandas scikit-learn pyhook

我在python中,我有177列的数据框,其中包含24小时的患者值,如下所示

subject_id hour_measure         urinecolor   Respiraory                 
3          1.00                 red          40
3          1.15                 red          90
4          2.00              yellow          60

我想每小时计算一些统计数据,例如均值,最大值,std,偏斜等

由于它包含文本和数字列,因此无法在所有数据框中循环进行汇总,因此,我尝试针对每一列进行汇总 就像下面的代码

 grouped= df.groupby(['Hour_measure','subject_id']).agg({"Heart Rate":['sum','min','max','std', 'count','var','skew']}) 
grouped2= df.groupby(['Hour_measure','subject_id']).agg({"Respiraory":['sum','min','max','std', 'count']})
  #write aggregated values to csv file 
 grouped.coloumns=["_".join(x) for x in grouped.columns.ravel()]
           grouped.to_csv('temp3.csv')

     with open('temp3.csv', 'a') as f:
        grouped2.to_csv(f, header=True)
    # make unstack to convert all to rows               
        df.set_index(['subject_id','Hour_measure']).unstack()

此代码可以正常工作,但是我想使用循环聚合每个数字列的想法。对于每个文本列,请选择小时内最常用的值而不是统计函数,并将其添加到最终将基于该值堆叠的文件中关于subject_id和hour_measure 终于有了

              heart rate 
                  1                             2              3.... to 24      then the next feature 
subject_id   min    max   std   skwe      min   max   std    
 1            40     110    50   60       60   290     40  

1 个答案:

答案 0 :(得分:0)

使用:

w

print (df)
   hour  subject_id  hour_measure urinecolor  Respiraory
0     1           3          1.00        red          40
1     1           3          1.15        red          90
2     1           4          2.00     yellow          60

df1 = (df.groupby(['hour_measure','subject_id', 'hour'])
        .agg(['sum','min','max','std', 'count','var','skew']))
print (df1)
                             Respiraory                           
                                    sum min max std count var skew
hour_measure subject_id hour                                      
1.00         3          1            40  40  40 NaN     1 NaN  NaN
1.15         3          1            90  90  90 NaN     1 NaN  NaN
2.00         4          1            60  60  60 NaN     1 NaN  NaN

f = lambda x: next(iter(x.mode()), None)
cols = df.select_dtypes(object).columns
df2 = df.groupby(['hour_measure','subject_id', 'hour'])[cols].agg(f)
df2.columns = pd.MultiIndex.from_product([df2.columns, ['mode']])
print (df2)
                             urinecolor
                                   mode
hour_measure subject_id hour           
1.00         3          1           red
1.15         3          1           red
2.00         4          1        yellow