我在python中,我有177列的数据框,其中包含24小时的患者值,如下所示
subject_id hour_measure urinecolor Respiraory
3 1.00 red 40
3 1.15 red 90
4 2.00 yellow 60
我想每小时计算一些统计数据,例如均值,最大值,std,偏斜等
由于它包含文本和数字列,因此无法在所有数据框中循环进行汇总,因此,我尝试针对每一列进行汇总 就像下面的代码
grouped= df.groupby(['Hour_measure','subject_id']).agg({"Heart Rate":['sum','min','max','std', 'count','var','skew']})
grouped2= df.groupby(['Hour_measure','subject_id']).agg({"Respiraory":['sum','min','max','std', 'count']})
#write aggregated values to csv file
grouped.coloumns=["_".join(x) for x in grouped.columns.ravel()]
grouped.to_csv('temp3.csv')
with open('temp3.csv', 'a') as f:
grouped2.to_csv(f, header=True)
# make unstack to convert all to rows
df.set_index(['subject_id','Hour_measure']).unstack()
此代码可以正常工作,但是我想使用循环聚合每个数字列的想法。对于每个文本列,请选择小时内最常用的值而不是统计函数,并将其添加到最终将基于该值堆叠的文件中关于subject_id和hour_measure 终于有了
heart rate
1 2 3.... to 24 then the next feature
subject_id min max std skwe min max std
1 40 110 50 60 60 290 40
答案 0 :(得分:0)
使用:
w
print (df)
hour subject_id hour_measure urinecolor Respiraory
0 1 3 1.00 red 40
1 1 3 1.15 red 90
2 1 4 2.00 yellow 60
df1 = (df.groupby(['hour_measure','subject_id', 'hour'])
.agg(['sum','min','max','std', 'count','var','skew']))
print (df1)
Respiraory
sum min max std count var skew
hour_measure subject_id hour
1.00 3 1 40 40 40 NaN 1 NaN NaN
1.15 3 1 90 90 90 NaN 1 NaN NaN
2.00 4 1 60 60 60 NaN 1 NaN NaN
f = lambda x: next(iter(x.mode()), None)
cols = df.select_dtypes(object).columns
df2 = df.groupby(['hour_measure','subject_id', 'hour'])[cols].agg(f)
df2.columns = pd.MultiIndex.from_product([df2.columns, ['mode']])
print (df2)
urinecolor
mode
hour_measure subject_id hour
1.00 3 1 red
1.15 3 1 red
2.00 4 1 yellow