使用时间戳差异将网页浏览分组到会话中

时间:2018-06-05 19:11:45

标签: sql postgresql timestamp sessionid pageviews

我有一个包含3个字段的表:user_id,page和timestamp,如下所示:

user_id     page        timestamp
1234567     home.all    2018-03-01 00:10
7541231     task.now    2018-03-01 03:51
7541231     home.all    2018-03-01 03:53
4544731     talk.wow    2018-03-01 04:56
4544731     task.now    2018-03-01 05:01
4544731     home.all    2018-03-01 05:02
4544731     bla.home    2018-03-01 05:26
4544731     home.all    2018-03-01 06:40

时间戳是具有给定ID的用户在网站上加载给定页面的时间。每次观察都是网页浏览。

我需要为每个观察分配会话ID 。每个会话ID对于每个会话应该是唯一的,其中会话是一组网页浏览,其中彼此最接近的时间戳的时间差不超过 3600秒,发生在相同的user_id

结果应如下所示:

user_id     page        timestamp           session_id
1234567     home.all    2018-03-01 00:10    1234567-2018030100100010
7541231     task.now    2018-03-01 03:51    7541231-2018030103510353
7541231     home.all    2018-03-01 03:53    7541231-2018030103510353
4544731     talk.wow    2018-03-01 04:56    4544731-2018030104560526
4544731     task.now    2018-03-01 05:01    4544731-2018030104560526
4544731     home.all    2018-03-01 05:02    4544731-2018030104560526
4544731     bla.home    2018-03-01 05:26    4544731-2018030104560526
4544731     home.all    2018-03-01 06:40    4544731-2018030106400640

请你提出任何疑问吗?

1 个答案:

答案 0 :(得分:0)

如果您可以确保user_idtimestamp对是唯一的,则以下内容可能对您有用。

WITH cte AS
(
SELECT h1.user_id,
       h1.page,
       h1.timestamp,
       coalesce(h1.timestamp - h2.timestamp <= INTERVAL '3600 SECONDS', false) shares_session_with_previous,
       coalesce(h4.timestamp - h1.timestamp <= INTERVAL '3600 SECONDS', false) shares_session_with_next
       FROM hit h1
            LEFT JOIN hit h2
                      ON h2.user_id = h1.user_id
                         AND h2.timestamp = (SELECT max(h3.timestamp)
                                                    FROM hit h3
                                                    WHERE h3.user_id = h1.user_id
                                                          AND h3.timestamp < h1.timestamp)
            LEFT JOIN hit h4
                      ON h4.user_id = h1.user_id
                         AND h4.timestamp = (SELECT min(h5.timestamp)
                                                    FROM hit h5
                                                    WHERE h5.user_id = h1.user_id
                                                          AND h5.timestamp > h1.timestamp)
)
SELECT c1.user_id,
       c1.page,
       c1.timestamp,
       concat((SELECT concat(c2.user_id, '-', to_char(max(c2.timestamp), 'YYYYMMDDHH24MI'))
                      FROM cte c2
                      WHERE c2.user_id = c1.user_id
                            AND c2.timestamp <= c1.timestamp
                            AND NOT c2.shares_session_with_previous
                      GROUP BY c2.user_id),
              (SELECT to_char(min(c2.timestamp), 'HH24MI')
                      FROM cte c2
                      WHERE c2.user_id = c1.user_id
                            AND c2.timestamp >= c1.timestamp
                            AND NOT c2.shares_session_with_next)) session_id
       FROM cte c1
       ORDER BY c1.timestamp;

核心部分是CTE。对于每一行,将连接具有最年轻旧时间戳的行和具有最早年龄时间戳的行。检查较旧或最年轻时间戳与行的时间戳之间的间隔小于或等于3600秒。检查结果存储在标记shares_session_with_previousshares_session_with_next

然后使用标志来获取会话的开始和结束。 begin是最早的时间戳,大于或等于shares_session_with_previousfalse的当前时间戳。结尾是年龄小于或等于shares_session_with_nextfalse的当前时间戳的最早时间戳。

会话开头和结尾的相应值会连接在一起,以显示会话ID。

SQL Fiddle