我有一个pandas数据框:
df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})
group_ids dates event_today_in_group
0 1 2016-04-01 1
1 1 2016-04-20 0
2 1 2016-04-28 1
3 2 2016-04-05 1
4 2 2016-04-20 1
5 2 2016-04-29 0
我想计算一个额外的列,其中包含每个group_ids,自上次event_today_in_group为1以来的天数。
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 0
1 1 2016-04-20 0 19
2 1 2016-04-28 1 27
3 2 2016-04-05 1 0
4 2 2016-04-20 1 15
5 2 2016-04-29 0 9
答案 0 :(得分:6)
正如我之前提到的,这将为您提供每组中日期之间的非累积差异:
df['days_since_last_event'] = df.groupby('group_ids')['dates'].diff().apply(lambda x: x.days)
为了获得此差异的累积总和,基于event_today_in_group
每次更改时,我建议使用shift
获取上一行的值,然后生成累积总和,如此:
df['event_today_in_group'].shift().cumsum()
输出:
0 NaN
1 1.0
2 1.0
3 2.0
4 3.0
5 4.0
这为我们提供了获得累积总和所需的第二个分组值。您可以将上述值分配给新列,但如果您只是将它们用于计算,则可以将它们简单地包含在后续groupby
操作中,如下所示:
df.loc[:, 'days_since_last_event'] = df.groupby(['group_ids', df['event_today_in_group'].shift().cumsum()])['days_since_last_event'].cumsum()
结果:
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 NaN
1 1 2016-04-20 0 19.0
2 1 2016-04-28 1 27.0
3 2 2016-04-05 1 NaN
4 2 2016-04-20 1 15.0
5 2 2016-04-29 0 9.0