理解DATEDIFF + LAG的含义

时间:2016-04-21 07:55:48

标签: sql amazon-redshift

我需要了解这个计算的含义:

DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 

对于没有lag的Amazon Redshift,如何在不使用datediffdatediff的情况下实现它。

这是完整的查询:

SELECT udid
         ,recday AS day
         ,count(*) AS session_count
         ,DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction1
       ,DATEDIFF(days, lag(recday, 2) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction2
         ,DATEDIFF(days, lag(recday, 3) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction3
       ,DATEDIFF(days, lag(recday, 4) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction4
          ,DATEDIFF(days, lag(recday, 5) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction5
        ,DATEDIFF(days, lag(recday, 6) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction6
   FROM   vvdays

这就是我的数据的外观 -

10000001 2016-02-03 17:26:03.0 10000001 2016-02-08 21:36:07.0 10000001 2016-02-10 07:48:06.0 10000012 2016-02-06 22:06:42.0 10000012 2016-02-06 22:07:42.0 10000028 2016-02-04 13:18:48.0 10000028 2016-02-04 13:30:42.0 10000028 2016-02-04 13:30:55.0 10000028 2016-02-05 16:48:41.0 10000028 2016-02-05 16:58:34.0 10000028 2016-02-07 15:44:33.0 10000028 2016-02-07 16:29:00.0 10000039 2016-02-03 21:16:49.0 10000039 2016-02-03 21:17:50.0 10000039 2016-02-03 21:18:49.0 10000039 2016-02-03 21:19:49.0 10000039 2016-02-03 21:20:50.0 10000039 2016-02-03 21:21:50.0 10000039 2016-02-03 21:22:51.0 10000039 2016-02-03 21:23:53.0 10000039 2016-02-03 21:24:49.0 10000039 2016-02-03 21:25:50.0 10000039 2016-02-03 21:26:50.0 10000039 2016-02-03 21:27:49.0 10000039 2016-02-05 21:58:59.0 10000039 2016-02-05 21:59:58.0 10000039 2016-02-05 22:00:58.0 10000039 2016-02-05 22:01:58.0 10000039 2016-02-05 22:02:59.0 10000039 2016-02-05 22:03:58.0 10000039 2016-02-05 22:05:00.0 10000039 2016-02-05 22:05:58.0 10000039 2016-02-05 22:06:58.0

1 个答案:

答案 0 :(得分:0)

在没有看到你的数据的情况下,我猜你的表'vvdays'包含两个字段'udid'和'recday'。 LAG函数基于udid获取第二,第三,第四,第五,第六和第七行数据。然后,DATEDIFF将第一个'recday'与其他行进行比较,并返回这两个日期之间的天数。

如何在redshift中复制它是另一个问题,你可以看看使用UNPIVOT将前7个结果放到同一行,然后在字段本身上运行DATEDIFF等效函数。

编辑:好的,我设计了一种真正的hacky方法来实现这个功能;

创建临时表以进行测试;

CREATE TABLE #vvdays (udid int, recday datetime)

插入一些数据编辑:现在使用OP提供的数据;

VALUES 
('10000001', '2016-02-03 17:26:03.0') 
,('10000001', '2016-02-08 21:36:07.0') 
,('10000001', '2016-02-10 07:48:06.0') 
,('10000012', '2016-02-06 22:06:42.0') 
,('10000012', '2016-02-06 22:07:42.0') 
,('10000028', '2016-02-04 13:18:48.0') 
,('10000028', '2016-02-04 13:30:42.0') 
,('10000028', '2016-02-04 13:30:55.0') 
,('10000028', '2016-02-05 16:48:41.0') 
,('10000028', '2016-02-05 16:58:34.0') 
,('10000028', '2016-02-07 15:44:33.0') 
,('10000028', '2016-02-07 16:29:00.0') 
,('10000039', '2016-02-03 21:16:49.0') 
,('10000039', '2016-02-03 21:17:50.0') 
,('10000039', '2016-02-03 21:18:49.0') 
,('10000039', '2016-02-03 21:19:49.0') 
,('10000039', '2016-02-03 21:20:50.0') 
,('10000039', '2016-02-03 21:21:50.0') 
,('10000039', '2016-02-03 21:22:51.0') 
,('10000039', '2016-02-03 21:23:53.0') 
,('10000039', '2016-02-03 21:24:49.0') 
,('10000039', '2016-02-03 21:25:50.0') 
,('10000039', '2016-02-03 21:26:50.0') 
,('10000039', '2016-02-03 21:27:49.0') 
,('10000039', '2016-02-05 21:58:59.0') 
,('10000039', '2016-02-05 21:59:58.0') 
,('10000039', '2016-02-05 22:00:58.0') 
,('10000039', '2016-02-05 22:01:58.0') 
,('10000039', '2016-02-05 22:02:59.0') 
,('10000039', '2016-02-05 22:03:58.0') 
,('10000039', '2016-02-05 22:05:00.0') 
,('10000039', '2016-02-05 22:05:58.0') 
,('10000039', '2016-02-05 22:06:58.0')

让这个工作变得非常可怕。由于您提到的限制以及我缺乏亚马逊特定的知识,我在下面为您完成了前两个值。如果你以这种方式做到这一点,你将最终得到一个大规模的声明,但它会起作用。我强烈建议你进一步研究,看看你有哪些等效功能;

SELECT day1.udid
    ,MAX(day1.recday) day1
    ,MAX(day2.recday) day2
    ,DATEDIFF(day,MAX(day2.recday),MAX(day1.recday)) Day2Diff
    ,MAX(day3.recday) day3
    ,DATEDIFF(day,MAX(day3.recday),MAX(day1.recday)) Day3Diff
FROM #vvdays day1
LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    WHERE a.recday <> b.recday
    GROUP BY a.udid
    ) day2 ON day1.udid = day2.udid
LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    WHERE a.recday <> b.recday
    GROUP BY a.udid
    ) day2 ON a.udid = day2.udid
    WHERE a.recday NOT IN (b.recday, day2.recday)
    GROUP BY a.udid
    ) day3 ON day1.udid = day3.udid
GROUP BY day1.udid

我在'day1'中使用MAX的原因是返回第一个日期。我在顶层的'day2'中使用它纯粹是为了把它变成一个聚合字段,你只能得到一个结果,这是一个错误的聚合,让GROUP BY正常工作。