以下是示例来源:
ID Date Duration
111 2020-01-01 00:42:23
111 2020-01-01 00:23:23
111 2020-01-02 00:37:22
222 2020-01-02 00:13:08
222 2020-01-03 01:52:11
....
999 2020-01-31 00:15:21
999 2020-01-31 00:52:12
我使用Pandas,我想按日期计算每天的持续时间总和,并计算按天计算> 30分钟(按ID分组)的每月总天数
这就是我需要得到的:
ID Total days when sum of duration by day from each ID > 30 min (per month)
111 2
222 1
....
999 5
类似这样:
aggregation = {
'num_days': pd.NamedAgg(column="duration", aggfunc=lambda x: x.sum() > dt.timedelta(minutes=30)),
}
total_active = df.groupby('Id').agg(**aggregation)
但这根本不是我所需要的...
有人可以帮忙吗?
答案 0 :(得分:0)
尝试一下,
df['_duration'] = pd.to_datetime(df['Duration'], format="%H:%M:%S").dt.hour
df_g = df.groupby('id')['_duration'].sum().reset_index()
# this should yield greater than 30.
df_g = df_g[df_g['_duration'] > 30]
答案 1 :(得分:0)
print(df)
ID Date Duration
0 111 2020-01-01 00:42:23
1 111 2020-01-01 00:23:23
2 111 2020-01-02 00:37:22
3 222 2020-01-02 00:13:08
4 222 2020-01-03 01:52:11
5 999 2020-01-31 00:15:21
6 999 2020-01-31 00:52:12
使用pd.Timedelta
将Duration
列的dtype转换为<m8[ns]
:
df['Duration'] = df.Duration.apply(pd.Timedelta)
,然后使用groupby
和sum
:
result = (df.groupby(['ID', "Date"])['Duration'].sum() > "30min").groupby("ID").sum()
输出:
ID
111 2.0
222 1.0
999 1.0
答案 2 :(得分:0)
不确定我们是求和还是算。但是要满足您的输出。
df['Date']=pd.to_datetime(df['Date'])#Coerce Date to datetime
df['Duration']=pd.to_timedelta(df['Duration'], unit='m')#Coerce duration to timedelta
df.set_index(df['Date'], inplace=True)#Set time as index
#Groupby date and id, examine condtiton and sum.
(df.groupby([df.index.date, df.ID])['Duration'].sum()>'30min').groupby('ID').sum()