查找新项目的开始日期

时间:2018-12-08 18:01:57

标签: sql amazon-redshift

对具有挑战性的SQL问题感兴趣,请继续阅读:

对于下面的数据集,我试图找到一种逻辑,该逻辑可以标识每个员工的新项目的开始日期。

Data Set

enter image description here

确定新项目开始日期的逻辑是:

  1. 在14天的时间范围内,员工没有当前日期之前的任何日期记录。

  2. 项目窗口仅在开始后的14天持续。落在该窗口之外的第一条记录将被计为下一个项目的开始。

What is needed

enter image description here

Redshift / Postgres解决方案均被接受。

请注意, Redshift 在窗口框架中不支持递归CTE或RANGE关键字。

感谢阅读。

1 个答案:

答案 0 :(得分:0)

对于Postgresql,包括数据集的CTE(DataSet),请按以下步骤操作:

WITH RECURSIVE TimeLine(Employee, ProjectID, ProjectStartDate, Date, DateRank) AS (
    SELECT Employee, 1, Date, Date, DateRank
    FROM DataSetWithRank
    WHERE DateRank = 1
    UNION ALL
    SELECT T.Employee,
           T.ProjectID + CASE When D.Date >= T.ProjectStartDate+14 THEN 1 Else 0 END,
           CASE When D.Date >= T.ProjectStartDate+14 THEN D.Date Else T.ProjectStartDate END,
           D.Date, D.DateRank
    FROM TimeLine T
    JOIN DataSetWithRank D ON D.Employee = T.Employee AND D.DateRank = T.DateRank + 1
), DataSet(Employee,Date) AS (
SELECT UNNEST(ARRAY['Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1']),
    UNNEST(ARRAY['2018-01-01','2018-01-03','2018-01-05','2018-01-08','2018-01-11','2018-01-13','2018-01-14','2018-01-16','2018-01-18','2018-01-21','2018-01-22','2018-01-24','2018-01-25','2018-01-27','2018-01-29']::date[])
UNION
SELECT UNNEST(ARRAY['Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2']),
    UNNEST(ARRAY['2018-01-03','2018-01-05','2018-01-07','2018-01-10','2018-01-13','2018-01-15','2018-01-16','2018-01-18','2018-01-20','2018-01-23','2018-01-24','2018-01-26','2018-01-27','2018-01-29','2018-01-31']::date[])
), DataSetWithRank AS (
SELECT *, DENSE_RANK() OVER (PARTITION BY Employee ORDER BY Date) AS DateRank
FROM DataSet
)
SELECT Employee,
       'Project ' || ProjectID AS "Project #",
       Date,
       DENSE_RANK() OVER (PARTITION BY Employee, ProjectID ORDER BY Date) AS Rank,
       CASE WHEN Date = ProjectStartDate THEN 'Y' ELSE NULL END AS Is_New
FROM TimeLine