我有一个熊猫数据框,如下所示:
event_id timestamp
0 e0 2015-07-20 12:00:56
1 e0 2015-07-20 13:00:56
2 e1 2015-07-20 01:30:00
3 e1 2015-07-20 02:30:00
4 e1 2015-07-20 03:00:00
5 e2 2015-07-20 18:45:00
6 e2 2015-07-20 18:47:00
7 e2 2015-07-20 18:48:00
8 e2 2015-07-20 18:49:00
我想计算每个事件产生的总时间:
timestamp count (minutes)
event_id
e0 2015-07-20 13:00:56 60.0
e1 2015-07-20 03:00:00 90.0
e2 2015-07-20 18:49:00 4.0
答案 0 :(得分:2)
使用groupby
和agg
s = df.groupby('event_id').timestamp.diff().div(pd.Timedelta(minutes=1))
df.assign(minutes=s).groupby('event_id').agg({'timestamp': 'last', 'minutes': 'sum'})
timestamp minutes
event_id
e0 2015-07-20 13:00:56 60.0
e1 2015-07-20 03:00:00 90.0
e2 2015-07-20 18:49:00 4.0
答案 1 :(得分:1)
重新创建数据框:
import pandas as pd
df = pd.DataFrame([['e0','2015-07-20 12:00:56'],
['e0','2015-07-20 13:00:56'],
['e1','2015-07-20 01:30:00'],
['e1','2015-07-20 02:30:00'],
['e1','2015-07-20 03:00:00'],
['e2','2015-07-20 18:45:00'],
['e2','2015-07-20 18:47:00'],
['e2','2015-07-20 18:48:00'],
['e2','2015-07-20 18:49:00']],
columns=['event_id','timestamp'])
您可以使用sort_values()
确保为timestamp
中的每个组对event_id
列进行排序。然后,您可以利用groupby()
和apply()
和pd.Timedelta()
来计算每个条目(或行)之间的时间差:
df['count (minutes)'] = df.sort_values(['event_id','timestamp']).groupby('event_id')['timestamp'].apply(lambda x: (x-x.iloc[0])/pd.Timedelta(1, 'm'))
哪个给:
event_id timestamp count (minutes)
0 e0 2015-07-20 12:00:56 0.0
1 e0 2015-07-20 13:00:56 60.0
2 e1 2015-07-20 01:30:00 0.0
3 e1 2015-07-20 02:30:00 60.0
4 e1 2015-07-20 03:00:00 90.0
5 e2 2015-07-20 18:45:00 0.0
6 e2 2015-07-20 18:47:00 2.0
7 e2 2015-07-20 18:48:00 3.0
8 e2 2015-07-20 18:49:00 4.0
然后您可以再次调用groupby()
并使用last()
返回最后一行:
df.groupby('event_id').last()
收益:
timestamp count (minutes)
event_id
e0 2015-07-20 13:00:56 60.0
e1 2015-07-20 03:00:00 90.0
e2 2015-07-20 18:49:00 4.0
答案 2 :(得分:0)
您可以尝试使用groupby而不进行排序,
{