如何从pyspark中的datetime列定义会话

时间:2018-11-16 17:38:04

标签: python datetime dataframe group-by pyspark

我的服务器日志基本上有3列:

  • 时间戳
  • ID
  • 动作

+-------------------+-------------+-------------+
|     time          |    ID       |    Action   |
+-------------------+-------------+-------------+
|2018-09-14 12:52:33|   Henry     |   Action F  |
|2018-09-15 17:34:00|   Henry     |   Action R  |
|2018-09-15 20:12:33|   Henry     |   Action T  |
|2018-09-15 14:34:33|   Jess      |   Action G  |
|2018-09-17 12:21:33|   Jess      |   Action R  |
|2018-09-19 23:11:33|   Jess      |   Action R  |
|2018-09-21 09:22:01|   Sarah     |   Action U  |
|2018-09-14 12:52:33|   Ali       |   Action P  |
|2018-09-14 12:53:33|   Ali       |   Action U  |
|2018-09-14 12:54:29|   Ali       |   Action U  |
|2018-09-14 12:54:51|   Ali       |   Action X  |
|2018-09-14 14:09:12|   Ali       |   Action O  |
|2018-09-14 14:13:32|   Ali       |   Action T  |
|2018-09-18 22:52:21|   Ali       |   Action E  |
|2018-09-20 12:52:44|   John      |   Action W  |
|2018-09-20 12:54:13|   John      |   Action Z  |
|2018-09-17 09:26:13|   Mike      |   Action W  |
|2018-09-17 10:39:33|   Mike      |   Action Q  |
|2018-09-18 12:15:33|   Mike      |   Action L  |
|2018-09-18 12:15:36|   Mike      |   Action L  |
+-------------------+---------------------------+
only showing top 20 rows

现在,我想(针对每个ID)将时间戳分为多个会话,其中一个会话是连续活动,两次操作之间的间隔不到3小时。 (暂停3小时或更长时间会开始新的会话)

然后我想按[Id,Session]分组以获取:

  • 会议开始
  • 会议结束
  • 该会话中的操作列表

赞:

+-------------+------------------------------------+-----------------------+-----------------------+
|    ID       |              Action_list           |           start       |          end          |
+-------------+------------------------------------+-----------------------+-----------------------+
|   Henry     |               [Action F]           |  2018-09-14 12:52:33  |  2018-09-14 12:52:33  |
|   Henry     |           [Action R, Action T]     |  2018-09-15 17:34:00  |  2018-09-15 20:12:33  |
|   Jess      |               [Action G]           |  2018-09-15 14:34:33  |  2018-09-15 14:34:33  |
|   Jess      |               [Action R]           |  2018-09-17 12:21:33  |  2018-09-17 12:21:33  |
|   Jess      |               [Action R]           |  2018-09-19 23:11:33  |  2018-09-19 23:11:33  |
|   Sarah     |               [Action U]           |  2018-09-21 09:22:01  |  2018-09-21 09:22:01  |
|   Ali       |  [Action P, Action U, Action U...] |  2018-09-14 12:52:33  |  2018-09-14 14:13:32  |
|   Ali       |               [Action E]           |  2018-09-18 22:52:21  |  2018-09-18 22:52:21  |
|   John      |               [Action Z]           |  2018-09-20 12:52:44  |  2018-09-20 12:54:13  |
|   Mike      |            [Action W, Action Q]    |  2018-09-17 09:26:13  |  2018-09-17 10:39:33  |
|   Mike      |            [Action L, Action L]    |  2018-09-18 12:15:33  |  2018-09-18 12:15:36  |
+-------------+------------------------------------+-----------------------+-----------------------+

这是用于再现数据帧的代码。

l=[('2018-09-14 12:52:33',   'Henry'     ,   'Action F'),
   ('2018-09-15 17:34:00',   'Henry'     ,   'Action R'),
   ('2018-09-15 20:12:33',   'Henry'     ,   'Action T'), 
   ('2018-09-15 14:34:33',   'Jess'      ,   'Action G'),
   ('2018-09-17 12:21:33',   'Jess'      ,   'Action R'),
   ('2018-09-19 23:11:33',   'Jess'      ,   'Action R'),
   ('2018-09-21 09:22:01',   'Sarah'     ,   'Action U'),
   ('2018-09-14 12:52:33',   'Ali'       ,   'Action P'),
   ('2018-09-14 12:53:33',   'Ali'       ,   'Action U'),
   ('2018-09-14 12:54:29',   'Ali'       ,   'Action U'),
   ('2018-09-14 12:54:51',   'Ali'       ,   'Action X'),
   ('2018-09-14 14:09:12',   'Ali'       ,   'Action O'),
   ('2018-09-14 14:13:32',   'Ali'       ,   'Action T'),
   ('2018-09-18 22:52:21',   'Ali'       ,   'Action E'),
   ('2018-09-20 12:52:44',   'John'      ,   'Action W'),
   ('2018-09-20 12:54:13',   'John'      ,   'Action Z'),
   ('2018-09-17 09:26:13',   'Mike'      ,   'Action W'),
   ('2018-09-17 10:39:33',   'Mike'      ,   'Action Q'),
   ('2018-09-18 12:15:33',   'Mike'      ,   'Action L'),
   ('2018-09-18 12:15:36',   'Mike'      ,   'Action L')]
  df = sqlContext.createDataFrame(l, ['time','name', 'action'])

我将非常感谢您的帮助

P.S。这些ID被散列。名称在这里是为了简单起见,因此这里没有侵犯隐私的行为;)

0 个答案:

没有答案