基于单个日期从多行创建日期范围

时间:2020-09-27 05:40:43

标签: sql amazon-redshift date-range gaps-and-islands

我有一个包含以下字段的用户表:User_ID,Email,Used_date。

Original Table

我们看到,用户可以在一段时间内切换到多封电子邮件。我想从 used_date 字段创建日期范围字段(Email_Start_Date和Email_End_Date)。他们将存储用户使用该电子邮件的时间段。

Expected Table

用户可能可以切换回旧电子邮件。在这种情况下,同一封电子邮件将有两个日期范围。

我也想填补上一封电子邮件的最后一天和当前电子邮件的开始日期之间的空白。

例如,如果用户使用某人@ gmail.com的时间为8/28/2020-8/31/2020。

他还于2020年9月3日切换到了person1@gmail.com。

然后在输出someone@gmail.com中,日期范围将为2020年8月28日至2020年9月2日。

这是一个空白与孤岛的情况。但是我不知道该怎么实现。

谢谢大家!

2 个答案:

答案 0 :(得分:0)

下次,将您的数据粘贴为文本,这样我们就不必再次输入...

这是您的意思吗?我更喜欢“无限日期”而不是最后一个截止日期的NULL值-我更喜欢“会话ID”而不是“岛屿标识符”,它们通常在点击流和IoT分析中被称为... < / p>

WITH
indata(userid,email,used_dt) AS (
          SELECT 1,'someone@gmail.com' , DATE '2020-08-28'
UNION ALL SELECT 1,'someone@gmail.com' , DATE '2020-08-29'
UNION ALL SELECT 1,'someone@gmail.com' , DATE '2020-08-30'
UNION ALL SELECT 1,'someone@gmail.com' , DATE '2020-08-31'
UNION ALL SELECT 1,'someone1@gmail.com', DATE '2020-09-03'
UNION ALL SELECT 1,'someone1@gmail.com', DATE '2020-09-05'
UNION ALL SELECT 1,'someone1@gmail.com', DATE '2020-09-07'
UNION ALL SELECT 1,'someone@gmail.com',  DATE '2020-09-09'
UNION ALL SELECT 2,'bob@gmail.com'     , DATE '2019-07-12'
UNION ALL SELECT 3,'alice@newmail.com' , DATE '2020-08-08'
)
,
with_change_counter AS (
SELECT 
  userid
, email
, used_dt AS used_from_dt
, CASE 
    WHEN LAG(email,1,'') OVER(
      PARTITION BY userid ORDER BY used_dt
    ) <> email 
    THEN 1
    ELSE 0 
  END AS counter
, LEAD(used_dt,1,'9999-12-31') OVER(
    PARTITION BY userid ORDER BY used_dt
  ) AS used_until_dt
  FROM indata
)
,with_sess_id AS (
  SELECT
    userid
  , email
  , used_from_dt
  , used_until_dt
  , SUM(counter) OVER(PARTITION BY userid ORDER BY used_from_dt) AS sessid
  , counter
  FROM with_change_counter
) 
SELECT
  userid
, MAX(email) AS email
, MIN(used_from_dt) AS email_start_date
, MAX(used_until_dt) AS email_end_date
FROM with_sess_id
GROUP BY
  sessid
, userid
ORDER BY
  userid
, sessid
, email
;
-- out  userid |       email        | email_start_date | email_end_date 
-- out --------+--------------------+------------------+----------------
-- out       1 | someone@gmail.com  | 2020-08-28       | 2020-09-03
-- out       1 | someone1@gmail.com | 2020-09-03       | 2020-09-09
-- out       1 | someone@gmail.com  | 2020-09-09       | 9999-12-31
-- out       2 | bob@gmail.com      | 2019-07-12       | 9999-12-31
-- out       3 | alice@newmail.com  | 2020-08-08       | 9999-12-31

答案 1 :(得分:0)

我只建议行号和聚合的区别:

select user_id, email, min(used_date) as email_start_date,
       lead(min(used_date)) over (partition by user_id order by min(used_date)) - interval '1 day' as email_end_date
from (select t.*,
             row_number() over (partition by user_id order by used_date) as seqnum,
             row_number() over (partition by user_id, email order by used_date) as seqnum_2
      from t
     ) t
group by user_id, email, (seqnum - seqnum_2);

实际上,您也可以使用lag()并且不进行汇总:

select user_id, email, min(used_date) as email_start_date,
       lead(used_date) over (partition by user_id order by used_date) - interval '1 day' as email_end_date
from (select t.*,
             lag(email) over (partition by user_id order by used_date) as prev_email
      from t
     ) t
where prev_email is null or prev_email <> email;

第二个很简单。它只是保留电子邮件更改的行(或用户数据开始的行)。然后,它使用lead()获取结束日期。

Here是db <>小提琴。