Question

背景

我每小时都有一些ETL作业处理实时日志文件。每当系统生成新事件时，它将拍摄所有历史事件摘要（如果存在）的快照，并将其与当前事件一起记录。然后将数据加载到Redshift中。

示例

表格如下所示：

+------------+--------------+---------+-----------+-------+-------+
| current_id | current_time | past_id | past_time | freq1 | freq2 |
+------------+--------------+---------+-----------+-------+-------+
|          2 |        time2 |       1 |     time1 |    13 |     5 |
|          3 |        time3 |       1 |     time1 |    13 |     5 |
|          3 |        time3 |       2 |     time2 |     2 |     1 |
|          4 |        time4 |       1 |     time1 |    13 |     5 |
|          4 |        time4 |       2 |     time2 |     2 |     1 |
|          4 |        time4 |       3 |     time3 |     1 |     1 |
+------------+--------------+---------+-----------+-------+-------+

这是上表所发生的事情：

time1：事件1发生了。系统拍摄了快照，但没有记录任何内容。
time2：事件2发生了。系统拍摄了快照并记录了事件1。
time3：事件3发生了。系统拍摄快照并记录事件1＆amp; 2。
time4：事件4发生了。系统拍摄快照并记录事件1,2和＆amp; 3。

期望的结果

我需要将数据转换为以下格式才能进行分析：

+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
|  1 |      time1 |     0 |     0 |
|  2 |      time2 |    13 |     5 |  --     13 |     5
|  3 |      time3 |    15 |     6 |  -- 13 + 2 | 5 + 1
|  4 |      time4 |    16 |     7 |  -- 15 + 1 | 6 + 1
+----+------------+-------+-------+

基本上，新的freq1和freq2是滞后freq1和freq2的累积和。

我的想法

我在 current_id 和 past_id 上考虑自我full outer join并首先获得以下结果：

+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
|  1 |      time1 |    13 |     5 |
|  2 |      time2 |     2 |     1 |
|  3 |      time3 |     1 |     1 |
|  4 |      time4 |  null |  null |
+----+------------+-------+-------+

然后我可以执行lag over()然后sum over()的窗口函数。

问题

这是正确的做法吗？有没有更有效的方法来做到这一点？这只是实际数据的一小部分，因此性能可能是一个问题。
我的查询总是返回很多重复的值，所以我不确定出了什么问题。

解决方案

来自@GordonLinoff的回答对于上述用例是正确的。我正在添加一些小的更新，以使其在我的实际表上工作。唯一的区别是我的 event_id 是一个36个字符的Java UUID，而 event_time 是时间戳。

select distinct past_id, past_time, 0 as freq1, 0 as freq2
from (
    select past_id, past_time,
           row_number() over (partition by current_id order by current_time desc) as seqnum
    from t
) a
where a.seqnum = 1
union all
select current_id, current_time,
       sum(freq1) over (order by current_time rows unbounded preceding) as freq1,
       sum(freq2) over (order by current_time rows unbounded preceding) as freq2
from (
    select current_id, current_time, freq1, freq2,
           row_number() over (partition by current_id order by past_id desc) as seqnum
    from t
) b
where b.seqnum = 1;

Answer 1

我想你想要union all和窗口函数。这是一个例子：

select min(past_id) as id, min(past_time) as event_time, 0 as freq1, 0 as freq2
from t
union all
(select current_id, current_time,
        sum(freq1) over (order by current_time),
        sum(freq2) over (order by current_time)
 from (select current_id, current_time, freq1, freq2,
              row_number() over (partition by current_id order by past_id desc) as seqnum
       from t
      ) t
  where seqnum = 1
);

Answer 2

您的数据在快照表中的方式，我认为以下SQL应该在您发布的期望结果中为您提供所需的内容

SELECT 1 AS id
      ,"time1" AS event_time
      ,0 AS freq1
      ,0 AS freq2
 UNION
SELECT T.id 
      ,T.current_time AS event_time
      ,SUM(T.freq1) AS freq1
      ,SUM(T.freq2) AS freq2
  FROM snapshot AS T
 GROUP
    BY T.id
      ,T.current_name

上面SELECT中的第一个UNION是为了获得time1的第一条记录，因为它在您的基表中确实没有包含所有快照的条目..它没有FROM，因为我们只选择变量，如果Redshift不支持它，你可能需要寻找与Oracle中DUAL表相当的东西。

希望这会有所帮助..

表本身上的完全外连接并运行一些窗口函数

2 个答案: