一个星期以来,我一直在努力寻找以下Redshift难题的解决方案(认为我对此非常着迷):
Redshift中有一个事件表(“ event_user_item”),用户可以通过输入出现在event_value列中的项目代码来触发某些项目的事件。
提交失败由event_type序列 PageLoad-ItemCode-ErrorResponse 组成,但此类事件类型不一定是连续的,这意味着 每个user_id在它们之间可以有许多其他事件类型。
我要根据3个不同的user_id发布一小段摘录,其中应阐明有关失败提交的相关方案。
ord_num event_type event_value user_id event_datetime
1 PageLoad 124 03/09/2018 21:48:39
2 ItemCode LG56731 124 03/09/2018 21:48:53
4 Details1PageLoad 124 03/09/2018 21:48:56
8 PageLoad 124 03/09/2018 22:02:23
9 ItemCode GU07019 124 03/09/2018 22:02:32
10 ErrorResponse Some message 124 03/09/2018 22:02:32
51 PageLoad 228 04/09/2018 12:38:30
52 ItemCode EQ23487 228 04/09/2018 12:38:33
53 ErrorResponse Some message 228 04/09/2018 12:38:34
54 PageLoad 304 04/09/2018 15:43:14
55 ItemCode OB68102 304 04/09/2018 15:43:57
56 ErrorResponse Some message 304 04/09/2018 15:43:58
57 ItemCode PB68102 304 04/09/2018 15:44:21
58 ErrorResponse Some message 304 04/09/2018 15:44:22
59 PageLoad 304 05/09/2018 11:19:37
60 ItemCode OB68102 304 05/09/2018 11:20:17
62 Details1PageLoad 304 05/09/2018 11:20:20
目标:根据每个ItemCode查找每个user_id提交失败的次数。 重要的是不要混淆失败提交和成功提交中的项目代码。另外,可能还会有多个相同项目代码的“失败”条目。
我不是Redshift的专家,尤其是其窗口功能, 但是我想坚持的第一个想法是LAG函数。为此,我打算识别符合条件的ord_nums序列,例如
ord_num event_type event_value user_id event_datetime error? sequence
1 PageLoad 124 03/09/2018 21:48:39
2 ItemCode LG56731 124 03/09/2018 21:48:53
4 Details1PageLoad 124 03/09/2018 21:48:56
8 PageLoad 124 03/09/2018 22:02:23
9 ItemCode GU07019 124 03/09/2018 22:02:32
10 ErrorResponse Some message 124 03/09/2018 22:02:32 1 8-9-10
51 PageLoad 228 04/09/2018 12:38:30
52 ItemCode EQ23487 228 04/09/2018 12:38:33
53 ErrorResponse Some message 228 04/09/2018 12:38:34 1 51-52-53
54 PageLoad 304 04/09/2018 15:43:14
55 ItemCode OB68102 304 04/09/2018 15:43:57
56 ErrorResponse Some message 304 04/09/2018 15:43:58 1 54-55-56
57 ItemCode PB68102 304 04/09/2018 15:44:21
58 ErrorResponse Some message 304 04/09/2018 15:44:22 1 54-57-58
59 PageLoad 304 05/09/2018 11:19:37
60 ItemCode OB68102 304 05/09/2018 11:20:17
62 Details1PageLoad 304 05/09/2018 11:20:20
因此,按user_id进行计数:
user_id nr_failed_submissions
124 1
228 1
304 2
但是,从上述数据集和预期结果可以看出,无法预测要倒退多少记录,我需要一个不能放在LAG内的附加条件...
我尝试了很多选择,但都不适合。
非常有用和有见地的帖子
但是直到现在,我还没有设法将它们全部融合到可行的解决方案中。在Redshift中必须有一种方法可以做到这一点?
答案 0 :(得分:1)
此查询将创建“时间范围”,其中time1代表该用户的PageLoad事件的时间戳记,time2代表该用户的下一个PageLoad事件的时间戳记:
WITH timeranges AS
(
SELECT A.user_id,
A.event_datetime AS time1,
nvl(MAX(B.event_datetime),'2099-01-01') AS time2
FROM foo AS A
LEFT JOIN foo AS B
ON A.user_id = B.user_id
AND A.event_datetime < B.event_datetime
AND A.event_type = B.event_type
WHERE A.event_type = 'PageLoad'
GROUP BY A.user_id,
A.event_datetime
)
此查询基于此,将每个“ ItemCode”事件与其对应的“ PageLoad”的时间戳关联:
SELECT timeranges.time1 AS pageloadtime,
foo.*
FROM foo
LEFT JOIN timeranges
ON foo.event_datetime >= timeranges.time1
AND foo.event_datetime < timeranges.time2
WHERE foo.event_type = 'ItemCode'
此查询确定是否有'ErrorResponse'事件是否属于这些范围中的每个范围:
SELECT timeranges.time1 AS pageloadtime,
timeranges.user_id,
BOOL_OR(foo.event_type = 'ErrorResponse') AS has_error
FROM timeranges
LEFT JOIN foo
ON event_datetime > time1
AND event_datetime < time2
GROUP BY timeranges.time1,
timeranges.user_id
HAVING has_error;
这应该满足了我们的所有需求-对于每个pageload事件,我们知道(1)该pageload是否有错误,以及(2)我们知道与该有效负载相关的所有ItemCode事件。这两个结果集之间的连接应该可以为我们提供所需的信息。
redshift的特殊性给我尝试直接连接这两个数据集带来了一些麻烦,因此我不得不创建两个临时表。这个格式错误的查询给了我预期的结果:
create temporary table items_per_pageload as
with timeranges as (select A.user_id, A.event_datetime as time1, nvl(max(B.event_datetime), '2099-01-01') as time2 from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad' group by A.user_id, A.event_datetime)
select timeranges.time1 as pageloadtime, event_user_item.* from event_user_item left join timeranges on event_user_item.event_datetime>=timeranges.time1 and event_user_item.event_datetime<timeranges.time2 where event_user_item.event_type='ItemCode'
create temporary table pageloads_with_errors as
with timeranges as (select A.user_id, A.event_datetime as time1, nvl(max(B.event_datetime), '2099-01-01') as time2 from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad' group by A.user_id, A.event_datetime)
select timeranges.time1 as pageloadtime, timeranges.user_id, bool_or(event_user_item.event_type='ErrorResponse') as has_error from timeranges left join event_user_item on event_datetime > time1 and event_datetime < time2
group by timeranges.time1, timeranges.user_id having has_error;
select count(1), user_id, event_value from (
select items_per_pageload.* from items_per_pageload join pageloads_with_errors on items_per_pageload.user_id = pageloads_with_errors.user_id and items_per_pageload.pageloadtime = pageloads_with_errors.pageloadtime
) group by user_id, event_value
答案 1 :(得分:0)
根据杰森·罗森代尔(Jason Rosendale)的答案1进行的以下方法和查询对我来说是应有的作用,
create temporary table items_per_pageload as
with timeranges as (
select A.user_id
,A.event_datetime as time1
,nvl(max(B.event_datetime), '2099-01-01') as time2
,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time
from event_user_item as A
left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad'
group by A.user_id, A.event_datetime
)
select timeranges.time1 as pageloadtime, event_user_item.*
from event_user_item left join timeranges on event_user_item.event_datetime>=timeranges.time1 and event_user_item.event_datetime<nvl(timeranges.next_load_time,timeranges.time2)
where event_user_item.event_type='ItemCode';
create temporary table pageloads_with_errors as
with timeranges as (
select A.user_id
,A.event_datetime as time1
,nvl(max(B.event_datetime), '2099-01-01') as time2
,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time
from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad'
group by A.user_id, A.event_datetime
)
select timeranges.time1 as pageloadtime,timeranges.user_id,bool_or(event_user_item.event_type='ErrorResponse') as has_error
from timeranges
left join event_user_item on event_datetime > time1 and event_datetime < nvl(next_load_time,time2)
group by timeranges.time1,timeranges.user_id
having has_error;
/* final counts */
select count(1), user_id, event_value from (
select items_per_pageload.*
from items_per_pageload
join pageloads_with_errors on items_per_pageload.user_id = pageloads_with_errors.user_id and items_per_pageload.pageloadtime = pageloads_with_errors.pageloadtime
)
group by user_id, event_value;