表本身上的完全外连接并运行一些窗口函数

时间:2016-01-27 22:21:50

标签: sql amazon-redshift

背景

我每小时都有一些ETL作业处理实时日志文件。每当系统生成新事件时,它将拍摄所有历史事件摘要(如果存在)的快照,并将其与当前事件一起记录。然后将数据加载到Redshift中。

示例

表格如下所示:

+------------+--------------+---------+-----------+-------+-------+
| current_id | current_time | past_id | past_time | freq1 | freq2 |
+------------+--------------+---------+-----------+-------+-------+
|          2 |        time2 |       1 |     time1 |    13 |     5 |
|          3 |        time3 |       1 |     time1 |    13 |     5 |
|          3 |        time3 |       2 |     time2 |     2 |     1 |
|          4 |        time4 |       1 |     time1 |    13 |     5 |
|          4 |        time4 |       2 |     time2 |     2 |     1 |
|          4 |        time4 |       3 |     time3 |     1 |     1 |
+------------+--------------+---------+-----------+-------+-------+

这是上表所发生的事情:

  1. time1:事件1发生了。系统拍摄了快照,但没有记录任何内容。
  2. time2:事件2发生了。系统拍摄了快照并记录了事件1。
  3. time3:事件3发生了。系统拍摄快照并记录事件1& 2。
  4. time4:事件4发生了。系统拍摄快照并记录事件1,2和& 3。
  5. 期望的结果

    我需要将数据转换为以下格式才能进行分析:

    +----+------------+-------+-------+
    | id | event_time | freq1 | freq2 |
    +----+------------+-------+-------+
    |  1 |      time1 |     0 |     0 |
    |  2 |      time2 |    13 |     5 |  --     13 |     5
    |  3 |      time3 |    15 |     6 |  -- 13 + 2 | 5 + 1
    |  4 |      time4 |    16 |     7 |  -- 15 + 1 | 6 + 1
    +----+------------+-------+-------+
    

    基本上,新的freq1和freq2是滞后freq1和freq2的累积和。

    我的想法

    我在 current_id past_id 上考虑自我full outer join并首先获得以下结果:

    +----+------------+-------+-------+
    | id | event_time | freq1 | freq2 |
    +----+------------+-------+-------+
    |  1 |      time1 |    13 |     5 |
    |  2 |      time2 |     2 |     1 |
    |  3 |      time3 |     1 |     1 |
    |  4 |      time4 |  null |  null |
    +----+------------+-------+-------+
    

    然后我可以执行lag over()然后sum over()的窗口函数。

    问题

    1. 这是正确的做法吗?有没有更有效的方法来做到这一点?这只是实际数据的一小部分,因此性能可能是一个问题。
    2. 我的查询总是返回很多重复的值,所以我不确定出了什么问题。
    3. 解决方案

      来自@GordonLinoff的回答对于上述用例是正确的。我正在添加一些小的更新,以使其在我的实际表上工作。唯一的区别是我的 event_id 是一个36个字符的Java UUID,而 event_time 是时间戳。

      select distinct past_id, past_time, 0 as freq1, 0 as freq2
      from (
          select past_id, past_time,
                 row_number() over (partition by current_id order by current_time desc) as seqnum
          from t
      ) a
      where a.seqnum = 1
      union all
      select current_id, current_time,
             sum(freq1) over (order by current_time rows unbounded preceding) as freq1,
             sum(freq2) over (order by current_time rows unbounded preceding) as freq2
      from (
          select current_id, current_time, freq1, freq2,
                 row_number() over (partition by current_id order by past_id desc) as seqnum
          from t
      ) b
      where b.seqnum = 1;
      

2 个答案:

答案 0 :(得分:1)

我想你想要union all和窗口函数。这是一个例子:

select min(past_id) as id, min(past_time) as event_time, 0 as freq1, 0 as freq2
from t
union all
(select current_id, current_time,
        sum(freq1) over (order by current_time),
        sum(freq2) over (order by current_time)
 from (select current_id, current_time, freq1, freq2,
              row_number() over (partition by current_id order by past_id desc) as seqnum
       from t
      ) t
  where seqnum = 1
);

答案 1 :(得分:0)

您的数据在快照表中的方式,我认为以下SQL应该在您发布的期望结果中为您提供所需的内容

SELECT 1 AS id
      ,"time1" AS event_time
      ,0 AS freq1
      ,0 AS freq2
 UNION
SELECT T.id 
      ,T.current_time AS event_time
      ,SUM(T.freq1) AS freq1
      ,SUM(T.freq2) AS freq2
  FROM snapshot AS T
 GROUP
    BY T.id
      ,T.current_name

上面SELECT中的第一个UNION是为了获得time1的第一条记录,因为它在您的基表中确实没有包含所有快照的条目..它没有FROM,因为我们只选择变量,如果Redshift不支持它,你可能需要寻找与Oracle中DUAL表相当的东西。

希望这会有所帮助..