我有一个看起来像这样的数据
subject_id hour_measure urine color heart_rate
3 1 red 40
3 1.15 red 60
4 2 yellow 50
我想重新索引数据以对每个患者进行24小时测量
我使用以下代码
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
df = df.groupby(['subject_id','hour_measure']).mean().reindex(mux).reset_index()
df.to_csv('totalafterreindex.csv')
它可以很好地用于数字值,但是对于分类值,它可以将其删除, 我该如何增强此代码以将均值用于数字,将最频繁地用于分类
所需的输出
subject_id hour_measure urine color heart_rate
3 1 red 40
3 2 red 60
3 3 yellow 50
3 4 yellow 50
.. .. ..
答案 0 :(得分:0)
将GroupBy.agg
与mean
用作数字,将mode
用作分类,也将next
与iter
一起用于返回None
如果mode
返回空值:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df1 = df.groupby(['subject_id','hour_measure']).agg(f).reindex(mux).reset_index()
详细信息:
print (df.groupby(['subject_id','hour_measure']).agg(f))
urine color heart_rate
subject_id hour_measure
3 1.00 red 40
1.15 red 60
4 2.00 yellow 50
根据subject_id
,最后GroupBy.ffill
使用https://github.com/AlexaCRM/dynamics-webapi-toolkit/wiki/Tutorial来最后一次填充缺失值:
cols = df.columns.difference(['subject_id','hour_measure'])
df[cols] = df.groupby('subject_id')[cols].ffill()