背景
我每小时都有一些ETL作业处理实时日志文件。每当系统生成新事件时,它将拍摄所有历史事件摘要(如果存在)的快照,并将其与当前事件一起记录。然后将数据加载到Redshift中。
示例
表格如下所示:
+------------+--------------+---------+-----------+-------+-------+
| current_id | current_time | past_id | past_time | freq1 | freq2 |
+------------+--------------+---------+-----------+-------+-------+
| 2 | time2 | 1 | time1 | 13 | 5 |
| 3 | time3 | 1 | time1 | 13 | 5 |
| 3 | time3 | 2 | time2 | 2 | 1 |
| 4 | time4 | 1 | time1 | 13 | 5 |
| 4 | time4 | 2 | time2 | 2 | 1 |
| 4 | time4 | 3 | time3 | 1 | 1 |
+------------+--------------+---------+-----------+-------+-------+
这是上表所发生的事情:
期望的结果
我需要将数据转换为以下格式才能进行分析:
+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
| 1 | time1 | 0 | 0 |
| 2 | time2 | 13 | 5 | -- 13 | 5
| 3 | time3 | 15 | 6 | -- 13 + 2 | 5 + 1
| 4 | time4 | 16 | 7 | -- 15 + 1 | 6 + 1
+----+------------+-------+-------+
基本上,新的freq1和freq2是滞后freq1和freq2的累积和。
我的想法
我在 current_id 和 past_id 上考虑自我full outer join
并首先获得以下结果:
+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
| 1 | time1 | 13 | 5 |
| 2 | time2 | 2 | 1 |
| 3 | time3 | 1 | 1 |
| 4 | time4 | null | null |
+----+------------+-------+-------+
然后我可以执行lag over()
然后sum over()
的窗口函数。
问题
解决方案
来自@GordonLinoff的回答对于上述用例是正确的。我正在添加一些小的更新,以使其在我的实际表上工作。唯一的区别是我的 event_id 是一个36个字符的Java UUID,而 event_time 是时间戳。
select distinct past_id, past_time, 0 as freq1, 0 as freq2
from (
select past_id, past_time,
row_number() over (partition by current_id order by current_time desc) as seqnum
from t
) a
where a.seqnum = 1
union all
select current_id, current_time,
sum(freq1) over (order by current_time rows unbounded preceding) as freq1,
sum(freq2) over (order by current_time rows unbounded preceding) as freq2
from (
select current_id, current_time, freq1, freq2,
row_number() over (partition by current_id order by past_id desc) as seqnum
from t
) b
where b.seqnum = 1;
答案 0 :(得分:1)
我想你想要union all
和窗口函数。这是一个例子:
select min(past_id) as id, min(past_time) as event_time, 0 as freq1, 0 as freq2
from t
union all
(select current_id, current_time,
sum(freq1) over (order by current_time),
sum(freq2) over (order by current_time)
from (select current_id, current_time, freq1, freq2,
row_number() over (partition by current_id order by past_id desc) as seqnum
from t
) t
where seqnum = 1
);
答案 1 :(得分:0)
您的数据在快照表中的方式,我认为以下SQL应该在您发布的期望结果中为您提供所需的内容
SELECT 1 AS id
,"time1" AS event_time
,0 AS freq1
,0 AS freq2
UNION
SELECT T.id
,T.current_time AS event_time
,SUM(T.freq1) AS freq1
,SUM(T.freq2) AS freq2
FROM snapshot AS T
GROUP
BY T.id
,T.current_name
上面SELECT
中的第一个UNION
是为了获得time1
的第一条记录,因为它在您的基表中确实没有包含所有快照的条目..它没有FROM
,因为我们只选择变量,如果Redshift不支持它,你可能需要寻找与Oracle中DUAL
表相当的东西。
希望这会有所帮助..