我正在处理一个记录网络课程用户活动的数据集。我们总共有大约5万用户和2000万个活动。我们正在运行postgres 9.5。
events
表包含created_at
(时间戳)和user_id
列。我想在此表中添加time
列,以秒为单位存储每个用户的后续事件之间的估计时间量。我还想将events
分隔为用户会话,这些会话由大于30分钟但没有活动的时段分隔。理想情况下,这些会话将从每个用户的1开始计算,但我可以使用全局序列。
通过以下窗口查询,我正在解决问题的第一部分 - 估计事件之间的时间。当代表会话结束的时间超过30分钟时,我将秒设置为NULL。
SELECT user_id, id, date_part AS diff,
CASE WHEN date_part > 1800 THEN NULL ELSE date_part END AS seconds
FROM
(SELECT user_id, id, EXTRACT(EPOCH FROM (lead - time))
FROM
(SELECT user_id, id, created_at AS time, lead(created_at)
OVER(PARTITION BY user_id ORDER BY created_at)
FROM events) AS A) AS B;
这给我留下了以下结果:
user_id | id | diff | seconds
-----------------+---------+-----------------+-------------
1 | 1934 | 4.499914 | 4.499914
1 | 1935 | 3.275266 | 3.275266
1 | 1936 | 125676.773213 |
1 | 2994 | 3.064404 | 3.064404
1 | 2995 | 4.692644 | 4.692644
1 | 3134 | 9.889537 | 9.889537
1 | 2996 | 32.071339 | 32.071339
1 | 2924 | 28.536395 | 28.536395
1 | 2997 | 1.508108 |
2 | 3236 | 18.364849 | 18.364849
2 | 3243 | 12.052791 | 12.052791
2 | 3245 | 12936.064333 |
2 | 3621 | 8.559128 | 8.559128
2 | 3672 | 381.158063 | 381.158063
2 | 3673 | 10797574.575174 |
2 | 1264501 | 3.242143 | 3.242143
2 | 1264546 | 1135.754492 | 1135.754492
2 | 1264577 | 256.417076 | 256.417076
2 | 1264244 | 18137835.531789 |
2 | 2736714 | 43.244278 | 43.244278
2 | 2736781 | 36204.912999 |
2 | 2747358 | 2.962074 | 2.962074
2 | 2747359 | 39448.37133 |
如何修改此查询以添加session
列,其中示例中的前三个事件(最多包括NULL秒值)是会话1,其余是会话2?现在,我最好的解决方案是循环遍历ruby中的用户和事件,这需要很长时间。
user_id | id | diff | seconds | session
-----------------+---------+-----------------+-------------
1 | 1934 | 4.499914 | 4.499914 | 1
1 | 1935 | 3.275266 | 3.275266 | 1
1 | 1936 | 125676.773213 | | 1
1 | 2994 | 3.064404 | 3.064404 | 2
1 | 2995 | 4.692644 | 4.692644 | 2
1 | 3134 | 9.889537 | 9.889537 | 2
1 | 2996 | 32.071339 | 32.071339 | 2
1 | 2924 | 28.536395 | 28.536395 | 2
1 | 2997 | 1.508108 | | 2
2 | 3236 | 18.364849 | 18.364849 | 3
2 | 3243 | 12.052791 | 12.052791 | 3
2 | 3245 | 12936.064333 | | 3
2 | 3621 | 8.559128 | 8.559128 | 4
2 | 3672 | 381.158063 | 381.158063 | 4
2 | 3673 | 10797574.575174 | | 4
2 | 1264501 | 3.242143 | 3.242143 | 5
2 | 1264546 | 1135.754492 | 1135.754492 | 5
2 | 1264577 | 256.417076 | 256.417076 | 5
2 | 1264244 | 18137835.531789 | | 5
2 | 2736714 | 43.244278 | 43.244278 | 6
2 | 2736781 | 36204.912999 | | 6
2 | 2747358 | 2.962074 | 2.962074 | 7
2 | 2747359 | 39448.37133 | | 7
谢谢!
答案 0 :(得分:0)
如果您可以更好地使用字段,只需添加ROW_NUMBER()
验证ORDER BY
,但我认为您可以使用NULL
WITH cte as (
SELECT user_id,
id,
date_part AS diff,
CASE
WHEN date_part > 1800 THEN NULL
ELSE date_part
END AS seconds,
ROW_NUMBER() OVER (ORDER BY ()) rn
FROM
(SELECT user_id, id, EXTRACT(EPOCH FROM (lead - time))
FROM
(SELECT user_id,
id,
created_at AS time,
lead(created_at) OVER(PARTITION BY user_id ORDER BY created_at)
FROM events) AS A
) AS B
)
SELECT *, CASE WHEN rn <= 3 THEN 1
ELSE 2
END as session
FROM cte