我需要了解这个计算的含义:
DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY udid
ORDER BY recday), recday)
对于没有lag
的Amazon Redshift,如何在不使用datediff
和datediff
的情况下实现它。
这是完整的查询:
SELECT udid
,recday AS day
,count(*) AS session_count
,DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction1
,DATEDIFF(days, lag(recday, 2) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction2
,DATEDIFF(days, lag(recday, 3) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction3
,DATEDIFF(days, lag(recday, 4) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction4
,DATEDIFF(days, lag(recday, 5) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction5
,DATEDIFF(days, lag(recday, 6) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction6
FROM vvdays
这就是我的数据的外观 -
10000001 2016-02-03 17:26:03.0 10000001 2016-02-08 21:36:07.0 10000001 2016-02-10 07:48:06.0 10000012 2016-02-06 22:06:42.0 10000012 2016-02-06 22:07:42.0 10000028 2016-02-04 13:18:48.0 10000028 2016-02-04 13:30:42.0 10000028 2016-02-04 13:30:55.0 10000028 2016-02-05 16:48:41.0 10000028 2016-02-05 16:58:34.0 10000028 2016-02-07 15:44:33.0 10000028 2016-02-07 16:29:00.0 10000039 2016-02-03 21:16:49.0 10000039 2016-02-03 21:17:50.0 10000039 2016-02-03 21:18:49.0 10000039 2016-02-03 21:19:49.0 10000039 2016-02-03 21:20:50.0 10000039 2016-02-03 21:21:50.0 10000039 2016-02-03 21:22:51.0 10000039 2016-02-03 21:23:53.0 10000039 2016-02-03 21:24:49.0 10000039 2016-02-03 21:25:50.0 10000039 2016-02-03 21:26:50.0 10000039 2016-02-03 21:27:49.0 10000039 2016-02-05 21:58:59.0 10000039 2016-02-05 21:59:58.0 10000039 2016-02-05 22:00:58.0 10000039 2016-02-05 22:01:58.0 10000039 2016-02-05 22:02:59.0 10000039 2016-02-05 22:03:58.0 10000039 2016-02-05 22:05:00.0 10000039 2016-02-05 22:05:58.0 10000039 2016-02-05 22:06:58.0
答案 0 :(得分:0)
在没有看到你的数据的情况下,我猜你的表'vvdays'包含两个字段'udid'和'recday'。 LAG函数基于udid获取第二,第三,第四,第五,第六和第七行数据。然后,DATEDIFF将第一个'recday'与其他行进行比较,并返回这两个日期之间的天数。
如何在redshift中复制它是另一个问题,你可以看看使用UNPIVOT将前7个结果放到同一行,然后在字段本身上运行DATEDIFF等效函数。
编辑:好的,我设计了一种真正的hacky方法来实现这个功能;
创建临时表以进行测试;
CREATE TABLE #vvdays (udid int, recday datetime)
插入一些数据编辑:现在使用OP提供的数据;
VALUES
('10000001', '2016-02-03 17:26:03.0')
,('10000001', '2016-02-08 21:36:07.0')
,('10000001', '2016-02-10 07:48:06.0')
,('10000012', '2016-02-06 22:06:42.0')
,('10000012', '2016-02-06 22:07:42.0')
,('10000028', '2016-02-04 13:18:48.0')
,('10000028', '2016-02-04 13:30:42.0')
,('10000028', '2016-02-04 13:30:55.0')
,('10000028', '2016-02-05 16:48:41.0')
,('10000028', '2016-02-05 16:58:34.0')
,('10000028', '2016-02-07 15:44:33.0')
,('10000028', '2016-02-07 16:29:00.0')
,('10000039', '2016-02-03 21:16:49.0')
,('10000039', '2016-02-03 21:17:50.0')
,('10000039', '2016-02-03 21:18:49.0')
,('10000039', '2016-02-03 21:19:49.0')
,('10000039', '2016-02-03 21:20:50.0')
,('10000039', '2016-02-03 21:21:50.0')
,('10000039', '2016-02-03 21:22:51.0')
,('10000039', '2016-02-03 21:23:53.0')
,('10000039', '2016-02-03 21:24:49.0')
,('10000039', '2016-02-03 21:25:50.0')
,('10000039', '2016-02-03 21:26:50.0')
,('10000039', '2016-02-03 21:27:49.0')
,('10000039', '2016-02-05 21:58:59.0')
,('10000039', '2016-02-05 21:59:58.0')
,('10000039', '2016-02-05 22:00:58.0')
,('10000039', '2016-02-05 22:01:58.0')
,('10000039', '2016-02-05 22:02:59.0')
,('10000039', '2016-02-05 22:03:58.0')
,('10000039', '2016-02-05 22:05:00.0')
,('10000039', '2016-02-05 22:05:58.0')
,('10000039', '2016-02-05 22:06:58.0')
让这个工作变得非常可怕。由于您提到的限制以及我缺乏亚马逊特定的知识,我在下面为您完成了前两个值。如果你以这种方式做到这一点,你将最终得到一个大规模的声明,但它会起作用。我强烈建议你进一步研究,看看你有哪些等效功能;
SELECT day1.udid
,MAX(day1.recday) day1
,MAX(day2.recday) day2
,DATEDIFF(day,MAX(day2.recday),MAX(day1.recday)) Day2Diff
,MAX(day3.recday) day3
,DATEDIFF(day,MAX(day3.recday),MAX(day1.recday)) Day3Diff
FROM #vvdays day1
LEFT JOIN (
SELECT a.udid
,MAX(a.recday) recday
FROM #vvdays a
LEFT JOIN (
SELECT udid
,MAX(recday) recday
FROM #vvdays
GROUP BY udid
) b ON a.udid = b.udid
WHERE a.recday <> b.recday
GROUP BY a.udid
) day2 ON day1.udid = day2.udid
LEFT JOIN (
SELECT a.udid
,MAX(a.recday) recday
FROM #vvdays a
LEFT JOIN (
SELECT udid
,MAX(recday) recday
FROM #vvdays
GROUP BY udid
) b ON a.udid = b.udid
LEFT JOIN (
SELECT a.udid
,MAX(a.recday) recday
FROM #vvdays a
LEFT JOIN (
SELECT udid
,MAX(recday) recday
FROM #vvdays
GROUP BY udid
) b ON a.udid = b.udid
WHERE a.recday <> b.recday
GROUP BY a.udid
) day2 ON a.udid = day2.udid
WHERE a.recday NOT IN (b.recday, day2.recday)
GROUP BY a.udid
) day3 ON day1.udid = day3.udid
GROUP BY day1.udid
我在'day1'中使用MAX的原因是返回第一个日期。我在顶层的'day2'中使用它纯粹是为了把它变成一个聚合字段,你只能得到一个结果,这是一个错误的聚合,让GROUP BY正常工作。