我有一个包含3个字段的表:user_id,page和timestamp,如下所示:
user_id page timestamp
1234567 home.all 2018-03-01 00:10
7541231 task.now 2018-03-01 03:51
7541231 home.all 2018-03-01 03:53
4544731 talk.wow 2018-03-01 04:56
4544731 task.now 2018-03-01 05:01
4544731 home.all 2018-03-01 05:02
4544731 bla.home 2018-03-01 05:26
4544731 home.all 2018-03-01 06:40
时间戳是具有给定ID的用户在网站上加载给定页面的时间。每次观察都是网页浏览。
我需要为每个观察分配会话ID 。每个会话ID对于每个会话应该是唯一的,其中会话是一组网页浏览,其中彼此最接近的时间戳的时间差不超过 3600秒,发生在相同的user_id
结果应如下所示:
user_id page timestamp session_id
1234567 home.all 2018-03-01 00:10 1234567-2018030100100010
7541231 task.now 2018-03-01 03:51 7541231-2018030103510353
7541231 home.all 2018-03-01 03:53 7541231-2018030103510353
4544731 talk.wow 2018-03-01 04:56 4544731-2018030104560526
4544731 task.now 2018-03-01 05:01 4544731-2018030104560526
4544731 home.all 2018-03-01 05:02 4544731-2018030104560526
4544731 bla.home 2018-03-01 05:26 4544731-2018030104560526
4544731 home.all 2018-03-01 06:40 4544731-2018030106400640
请你提出任何疑问吗?
答案 0 :(得分:0)
如果您可以确保user_id
和timestamp
对是唯一的,则以下内容可能对您有用。
WITH cte AS
(
SELECT h1.user_id,
h1.page,
h1.timestamp,
coalesce(h1.timestamp - h2.timestamp <= INTERVAL '3600 SECONDS', false) shares_session_with_previous,
coalesce(h4.timestamp - h1.timestamp <= INTERVAL '3600 SECONDS', false) shares_session_with_next
FROM hit h1
LEFT JOIN hit h2
ON h2.user_id = h1.user_id
AND h2.timestamp = (SELECT max(h3.timestamp)
FROM hit h3
WHERE h3.user_id = h1.user_id
AND h3.timestamp < h1.timestamp)
LEFT JOIN hit h4
ON h4.user_id = h1.user_id
AND h4.timestamp = (SELECT min(h5.timestamp)
FROM hit h5
WHERE h5.user_id = h1.user_id
AND h5.timestamp > h1.timestamp)
)
SELECT c1.user_id,
c1.page,
c1.timestamp,
concat((SELECT concat(c2.user_id, '-', to_char(max(c2.timestamp), 'YYYYMMDDHH24MI'))
FROM cte c2
WHERE c2.user_id = c1.user_id
AND c2.timestamp <= c1.timestamp
AND NOT c2.shares_session_with_previous
GROUP BY c2.user_id),
(SELECT to_char(min(c2.timestamp), 'HH24MI')
FROM cte c2
WHERE c2.user_id = c1.user_id
AND c2.timestamp >= c1.timestamp
AND NOT c2.shares_session_with_next)) session_id
FROM cte c1
ORDER BY c1.timestamp;
核心部分是CTE。对于每一行,将连接具有最年轻旧时间戳的行和具有最早年龄时间戳的行。检查较旧或最年轻时间戳与行的时间戳之间的间隔小于或等于3600秒。检查结果存储在标记shares_session_with_previous
和shares_session_with_next
。
然后使用标志来获取会话的开始和结束。 begin是最早的时间戳,大于或等于shares_session_with_previous
为false
的当前时间戳。结尾是年龄小于或等于shares_session_with_next
为false
的当前时间戳的最早时间戳。
会话开头和结尾的相应值会连接在一起,以显示会话ID。