我正在尝试为某些日志数据创建一个事件计数器以及自第一个事件计数器以来的几天。下面的DataFrame跟踪每天是否有一组事件发生。对于每个组,我需要计算在任何日期之前和该日期之前发生的事件数。我还需要计算自每个组发生第一次事件以来的天数
启动DF
$ bash example.sh -h
Usage: bash example.sh -n Willy --gender male -a 99
--person_name | -n [Willy] What is your name?
--age | -a [Required]
--gender | -g [Required]
--location | -l [chocolate-factory] insert your location
生成DF的代码
group date event
A 2020-07-16 0
A 2020-07-17 1
A 2020-07-18 0
A 2020-07-19 1
A 2020-07-20 0
A 2020-07-21 0
A 2020-07-22 1
B 2020-07-16 1
B 2020-07-17 1
B 2020-07-18 0
B 2020-07-19 1
B 2020-07-20 0
B 2020-07-21 1
B 2020-07-22 1
结束DF
import pandas as pd
import datetime as datetime
base = datetime.datetime.today()
numdays = 7
date_list = [(base - datetime.timedelta(days=x)).date() for x in range(numdays)]
df = pd.DataFrame(columns=['group', 'date'])
for group in ['A', 'B']:
tmp = pd.DataFrame({'group': group, 'date': date_list})
df = df.append(tmp)
df = df.sort_values(['group', 'date'])
groupA_events = [0, 1, 0, 1, 0, 0, 1]
groupB_events = [1, 1, 0, 1, 0, 1, 1]
events = groupA_events + groupB_events
df['event'] = events
我的数据大约有80万行(并且还在不断增长)。我找到了一种可行的解决方案,但是执行时间却非常长。
答案 0 :(得分:2)
您可以通过groupby
+ cumsum
来cumcount
df['counter']=df.groupby('group').event.cumsum()
df['since_first']=df[df['counter'].ne(0)].groupby('group')['counter'].cumcount()
df['since_first'].fillna(0, inplace=True)
答案 1 :(得分:2)
使用cumsum
获取计数器。可以通过掩盖transforming
至每个组中发生事件的第一天来获得自此以来的天数。这很有用,因为您的日期不是连续的,它将仍然正确计算时差。 ({clip
,因此之前的所有内容都视为0)
df['counter'] = df.groupby('group').agg(counter=('event', 'cumsum'))
df['date'] = pd.to_datetime(df['date'])
s_first = df['date'].where(df['event'].eq(1)).groupby(df['group']).transform('first')
df['days_since'] = (df['date'] - s_first).dt.days.clip(lower=0)
group date event counter days_since
6 A 2020-07-16 0 0 0
5 A 2020-07-17 1 1 0
4 A 2020-07-18 0 1 1
3 A 2020-07-19 1 2 2
2 A 2020-07-20 0 2 3
1 A 2020-07-21 0 2 4
0 A 2020-07-22 1 3 5
6 B 2020-07-16 1 1 0
5 B 2020-07-17 1 2 1
4 B 2020-07-18 0 2 2
3 B 2020-07-19 1 3 3
2 B 2020-07-20 0 3 4
1 B 2020-07-21 1 4 5
0 B 2020-07-22 1 5 6