我的服务器日志基本上有3列:
+-------------------+-------------+-------------+
| time | ID | Action |
+-------------------+-------------+-------------+
|2018-09-14 12:52:33| Henry | Action F |
|2018-09-15 17:34:00| Henry | Action R |
|2018-09-15 20:12:33| Henry | Action T |
|2018-09-15 14:34:33| Jess | Action G |
|2018-09-17 12:21:33| Jess | Action R |
|2018-09-19 23:11:33| Jess | Action R |
|2018-09-21 09:22:01| Sarah | Action U |
|2018-09-14 12:52:33| Ali | Action P |
|2018-09-14 12:53:33| Ali | Action U |
|2018-09-14 12:54:29| Ali | Action U |
|2018-09-14 12:54:51| Ali | Action X |
|2018-09-14 14:09:12| Ali | Action O |
|2018-09-14 14:13:32| Ali | Action T |
|2018-09-18 22:52:21| Ali | Action E |
|2018-09-20 12:52:44| John | Action W |
|2018-09-20 12:54:13| John | Action Z |
|2018-09-17 09:26:13| Mike | Action W |
|2018-09-17 10:39:33| Mike | Action Q |
|2018-09-18 12:15:33| Mike | Action L |
|2018-09-18 12:15:36| Mike | Action L |
+-------------------+---------------------------+
only showing top 20 rows
现在,我想(针对每个ID)将时间戳分为多个会话,其中一个会话是连续活动,两次操作之间的间隔不到3小时。 (暂停3小时或更长时间会开始新的会话)
然后我想按[Id,Session]分组以获取:
赞:
+-------------+------------------------------------+-----------------------+-----------------------+
| ID | Action_list | start | end |
+-------------+------------------------------------+-----------------------+-----------------------+
| Henry | [Action F] | 2018-09-14 12:52:33 | 2018-09-14 12:52:33 |
| Henry | [Action R, Action T] | 2018-09-15 17:34:00 | 2018-09-15 20:12:33 |
| Jess | [Action G] | 2018-09-15 14:34:33 | 2018-09-15 14:34:33 |
| Jess | [Action R] | 2018-09-17 12:21:33 | 2018-09-17 12:21:33 |
| Jess | [Action R] | 2018-09-19 23:11:33 | 2018-09-19 23:11:33 |
| Sarah | [Action U] | 2018-09-21 09:22:01 | 2018-09-21 09:22:01 |
| Ali | [Action P, Action U, Action U...] | 2018-09-14 12:52:33 | 2018-09-14 14:13:32 |
| Ali | [Action E] | 2018-09-18 22:52:21 | 2018-09-18 22:52:21 |
| John | [Action Z] | 2018-09-20 12:52:44 | 2018-09-20 12:54:13 |
| Mike | [Action W, Action Q] | 2018-09-17 09:26:13 | 2018-09-17 10:39:33 |
| Mike | [Action L, Action L] | 2018-09-18 12:15:33 | 2018-09-18 12:15:36 |
+-------------+------------------------------------+-----------------------+-----------------------+
这是用于再现数据帧的代码。
l=[('2018-09-14 12:52:33', 'Henry' , 'Action F'),
('2018-09-15 17:34:00', 'Henry' , 'Action R'),
('2018-09-15 20:12:33', 'Henry' , 'Action T'),
('2018-09-15 14:34:33', 'Jess' , 'Action G'),
('2018-09-17 12:21:33', 'Jess' , 'Action R'),
('2018-09-19 23:11:33', 'Jess' , 'Action R'),
('2018-09-21 09:22:01', 'Sarah' , 'Action U'),
('2018-09-14 12:52:33', 'Ali' , 'Action P'),
('2018-09-14 12:53:33', 'Ali' , 'Action U'),
('2018-09-14 12:54:29', 'Ali' , 'Action U'),
('2018-09-14 12:54:51', 'Ali' , 'Action X'),
('2018-09-14 14:09:12', 'Ali' , 'Action O'),
('2018-09-14 14:13:32', 'Ali' , 'Action T'),
('2018-09-18 22:52:21', 'Ali' , 'Action E'),
('2018-09-20 12:52:44', 'John' , 'Action W'),
('2018-09-20 12:54:13', 'John' , 'Action Z'),
('2018-09-17 09:26:13', 'Mike' , 'Action W'),
('2018-09-17 10:39:33', 'Mike' , 'Action Q'),
('2018-09-18 12:15:33', 'Mike' , 'Action L'),
('2018-09-18 12:15:36', 'Mike' , 'Action L')]
df = sqlContext.createDataFrame(l, ['time','name', 'action'])
我将非常感谢您的帮助
P.S。这些ID被散列。名称在这里是为了简单起见,因此这里没有侵犯隐私的行为;)