将带有时间戳的列排序为会话

时间:2016-02-26 02:11:19

标签: postgresql

我正在处理一个记录网络课程用户活动的数据集。我们总共有大约5万用户和2000万个活动。我们正在运行postgres 9.5。

events表包含created_at(时间戳)和user_id列。我想在此表中添加time列,以秒为单位存储每个用户的后续事件之间的估计时间量。我还想将events分隔为用户会话,这些会话由大于30分钟但没有活动的时段分隔。理想情况下,这些会话将从每个用户的1开始计算,但我可以使用全局序列。

通过以下窗口查询,我正在解决问题的第一部分 - 估计事件之间的时间。当代表会话结束的时间超过30分钟时,我将秒设置为NULL。

SELECT user_id, id, date_part AS diff,
    CASE WHEN date_part > 1800 THEN NULL ELSE date_part END AS seconds 
FROM 
    (SELECT user_id, id, EXTRACT(EPOCH FROM (lead - time)) 
    FROM 
    (SELECT user_id, id, created_at AS time, lead(created_at) 
        OVER(PARTITION BY user_id ORDER BY created_at) 
        FROM events) AS A) AS B;

这给我留下了以下结果:

         user_id |   id    |      diff       |   seconds   
-----------------+---------+-----------------+-------------
               1 |    1934 |        4.499914 |    4.499914
               1 |    1935 |        3.275266 |    3.275266
               1 |    1936 |   125676.773213 |            
               1 |    2994 |        3.064404 |    3.064404
               1 |    2995 |        4.692644 |    4.692644
               1 |    3134 |        9.889537 |    9.889537
               1 |    2996 |       32.071339 |   32.071339
               1 |    2924 |       28.536395 |   28.536395
               1 |    2997 |       1.508108  |   
               2 |    3236 |       18.364849 |   18.364849
               2 |    3243 |       12.052791 |   12.052791
               2 |    3245 |    12936.064333 |            
               2 |    3621 |        8.559128 |    8.559128
               2 |    3672 |      381.158063 |  381.158063
               2 |    3673 | 10797574.575174 |            
               2 | 1264501 |        3.242143 |    3.242143
               2 | 1264546 |     1135.754492 | 1135.754492
               2 | 1264577 |      256.417076 |  256.417076
               2 | 1264244 | 18137835.531789 |            
               2 | 2736714 |       43.244278 |   43.244278
               2 | 2736781 |    36204.912999 |            
               2 | 2747358 |        2.962074 |    2.962074
               2 | 2747359 |     39448.37133 |            

如何修改此查询以添加session列,其中示例中的前三个事件(最多包括NULL秒值)是会话1,其余是会话2?现在,我最好的解决方案是循环遍历ruby中的用户和事件,这需要很长时间。

         user_id |   id    |      diff       |   seconds   |    session
-----------------+---------+-----------------+-------------
               1 |    1934 |        4.499914 |    4.499914 |        1
               1 |    1935 |        3.275266 |    3.275266 |        1
               1 |    1936 |   125676.773213 |             |        1
               1 |    2994 |        3.064404 |    3.064404 |        2
               1 |    2995 |        4.692644 |    4.692644 |        2
               1 |    3134 |        9.889537 |    9.889537 |        2
               1 |    2996 |       32.071339 |   32.071339 |        2
               1 |    2924 |       28.536395 |   28.536395 |        2
               1 |    2997 |        1.508108 |             |        2
               2 |    3236 |       18.364849 |   18.364849 |        3
               2 |    3243 |       12.052791 |   12.052791 |        3
               2 |    3245 |    12936.064333 |             |        3
               2 |    3621 |        8.559128 |    8.559128 |        4
               2 |    3672 |      381.158063 |  381.158063 |        4
               2 |    3673 | 10797574.575174 |             |        4
               2 | 1264501 |        3.242143 |    3.242143 |        5
               2 | 1264546 |     1135.754492 | 1135.754492 |        5
               2 | 1264577 |      256.417076 |  256.417076 |        5
               2 | 1264244 | 18137835.531789 |             |        5
               2 | 2736714 |       43.244278 |   43.244278 |        6
               2 | 2736781 |    36204.912999 |             |        6
               2 | 2747358 |        2.962074 |    2.962074 |        7
               2 | 2747359 |     39448.37133 |             |        7

谢谢!

1 个答案:

答案 0 :(得分:0)

如果您可以更好地使用字段,只需添加ROW_NUMBER()验证ORDER BY,但我认为您可以使用NULL

WITH cte as (
    SELECT user_id, 
           id, 
           date_part AS diff,
           CASE  
                WHEN date_part > 1800 THEN NULL 
                ELSE date_part 
           END AS seconds,
           ROW_NUMBER() OVER (ORDER BY ()) rn
    FROM 
        (SELECT user_id, id, EXTRACT(EPOCH FROM (lead - time)) 
         FROM 
             (SELECT user_id, 
                     id, 
                     created_at AS time, 
                     lead(created_at) OVER(PARTITION BY user_id ORDER BY created_at) 
              FROM events) AS A
            ) AS B
)
SELECT *, CASE WHEN rn <= 3 THEN 1 
               ELSE 2
          END as session              
FROM cte