我有一个类似于这个的DataFrame:
+------------+---------------------+---------+
| action | ts | uid |
+------------+---------------------+---------+
| action1 | 2013-01-01 00:00:00 | 543534 |
| action2 | 2013-01-01 00:00:00 | 543544 |
| action1 | 2013-01-01 00:00:02 | 543542 |
| action2 | 2013-01-01 00:00:03 | 543541 |
| .... | .... | ... |
+------------+---------------------+---------+
我想计算每个用户在给定时间范围内执行的每种类型actions
的数量,因此预期输出是这样的:
uid action1 action2
543534 10 1
543534 0 2
...
我正在考虑通过首先应用.groupby('uid')
然后遍历分组对象,选择行然后ts
在给定范围内,然后将数据帧连接到结果数据帧,排序
所以,就像那样:
df = ...
start_date = ...
end_date = ...
result = {}
grouped = df.groupby('uid')
grouped_dict = dict(list(grouped))
for item in grouped.keys:
df = grouped[item]
result[item] = df[df.ts > start_date and df.ts < end_date].size()
我没有运行此代码,但我认为即使它运行起来效率也非常低。即使将分组对象转换为字典也需要花费大量时间。在这种情况下,哪种方法更有效?
答案 0 :(得分:4)
您可以按uid
和action
分组:
start_date = pd.to_datetime('2013-01-01 00:00:00')
end_date = pd.to_datetime('2013-01-01 00:00:07')
print df
print df[(df.ts > start_date) & (df.ts < end_date)].groupby(['uid','action'])['ts'].count().unstack('action').fillna(0)
输出:
action ts uid
0 action1 2013-01-01 00:00:00 1
1 action2 2013-01-01 00:00:00 2
2 action1 2013-01-01 00:00:02 2
3 action2 2013-01-01 00:00:03 1
4 action2 2013-01-01 00:00:04 2
5 action2 2013-01-01 00:00:05 1
6 action1 2013-01-01 00:00:06 1
action action1 action2
uid
1 1 2
2 1 1
答案 1 :(得分:1)
查看pandas.DataFrame
的界面,我会选择这样的数据:
# Select the interesting date range
bydate = df[(df['ts'] > start_date & df.ts < end_date]
# Now this will group for uid, *then* by action
grouped = bydate.groupby(('uid', 'action'))
现在,让我们打印每个uid的操作数:
for indices, data in grouped:
print("Uid {}, Action '{}': {}".format(indices[0], indices[1], len(data))