我正在为点击流数据创建会话ID。如果用户在其活动之间超过30分钟不活动(即,具有链接链的记录之间的时差),则会创建并分配新的会话ID。
到目前为止,我能够创建一个全新的表,并将新的会话ID分配为主表中可用数据的单独列。
这是一个计算量很大的查询,占用更多的存储空间,因为它创建了一个全新的表(当主表仍然同时存在时)。创建新表后,我不得不删除主表。
是否可以分配会话ID并完成整个过程而无需创建新表? 优化的查询需要在Redshift Postgresql中工作。
CREATE TABLE <new_table_name> AS
SELECT * , userid || '_' || SUM(session) OVER (PARTITION BY userid ORDER BY date rows unbounded preceding) AS session_id
FROM (
SELECT *
, CASE
WHEN EXTRACT(EPOCH FROM date) - LAG(EXTRACT(EPOCH FROM date)) OVER (PARTITION BY userid ORDER BY date) >= 30 * 60
THEN 1
WHEN row_number() over (partition by userid order by date) = 1
THEN 1
ELSE 0
END as session
FROM
<table_name>
);
答案 0 :(得分:0)
这(添加会话列后)?
UPDATE <table_name> SET session=s.session
FROM (
SELECT user_id, date,
CASE
WHEN EXTRACT(EPOCH FROM date) - LAG(EXTRACT(EPOCH FROM date))
OVER (PARTITION BY userid ORDER BY date) >= 30 * 60 THEN 1
WHEN row_number() over (partition by userid order by date) = 1 THEN 1
ELSE 0 END as session
FROM <table_name>
) s where <table_name>.user_id = s.user_id and <table_name>.date = s.date