窗口函数LAG可以引用正在计算值的列吗?

时间:2015-12-17 16:01:46

标签: postgresql gaps-and-islands

我需要根据当前记录的其他一些列和前一个记录的X值(使用一些分区和顺序)来计算某些列X的值。基本上我需要以

的形式实现查询
SELECT <some fields>, 
  <some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>

这是不可能的,因为只有现有的列可用于窗口功能,因此我正在寻找如何克服这一点。

这是一个例子。我有一张活动表。每个活动都有typetime_stamp

create table event (id serial, type integer, time_stamp integer);

我找不到&#34;重复&#34;事件。副本我的意思是以下。我们按type升序对给定time_stamp的所有事件进行排序。然后

  1. 第一个事件不重复
  2. 所有跟随非重复且在其后的某个时间范围内的事件(即他们的time_stamp不大于先前非重复的time_stamp加上一些常量TIMEFRAME)重复
  3. 下一个事件time_stamp如果大于之前的非重复次数超过TIMEFRAME不重复
  4. 对于此数据

    insert into event (type, time_stamp) 
     values 
      (1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10), 
      (1,15), (1, 21), (2,13), 
      (1, 40);
    

    TIMEFRAME=10结果应为

    time_stamp | type | duplicate
    -----------------------------
            1  |    1 | false
            2  |    1 | true     
            3  |    1 | true 
           10  |    1 | true 
           15  |    1 | false 
           21  |    1 | true
           40  |    1 | false
            2  |    2 | false
           10  |    2 | true
           13  |    2 | false
    

    我可以根据之前非重复事件的当前duplicatetime_stamp来计算time_stamp字段的值,如下所示:

    WITH evt AS (
      SELECT 
        time_stamp, 
        CASE WHEN 
          time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
        THEN 
          time_stamp
        ELSE
          LAG(current_non_dupl_time_stamp) OVER w
        END AS current_non_dupl_time_stamp
      FROM event
      WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
    )
    SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate
    

    但这不起作用,因为计算的字段无法在LAG中引用:

    ERROR:  column "current_non_dupl_time_stamp" does not exist.
    

    所以问题:我可以重写这个查询以达到我需要的效果吗?

3 个答案:

答案 0 :(得分:2)

天真的递归链编织物:

        -- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
        SELECT e.type,e.time_stamp
        , ROW_NUMBER() OVER www as rn                   -- number the records
        , FIRST_VALUE(e.time_stamp) OVER www as fst     -- the "group leader"
        , EXISTS (SELECT * FROM event x
                WHERE x.type = e.type
                AND x.time_stamp < e.time_stamp) AS is_dup
        FROM event e
        WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
        ;

WITH RECURSIVE ttt AS (
        SELECT d0.*
        FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
    UNION ALL
        SELECT d1.type, d1.time_stamp, d1.rn
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
                 ELSE ttt.fst END AS fst   -- new "group leader"
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
                 ELSE True END AS is_dup
        FROM drag d1
        JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
        )
SELECT * FROM ttt
ORDER BY type, time_stamp
        ;

结果:

CREATE TABLE
INSERT 0 10
CREATE VIEW
 type | time_stamp | rn | fst | is_dup 
------+------------+----+-----+--------
    1 |          1 |  1 |   1 | f
    1 |          2 |  2 |   1 | t
    1 |          3 |  3 |   1 | t
    1 |         10 |  4 |   1 | t
    1 |         15 |  5 |   1 | t
    1 |         21 |  6 |   1 | t
    1 |         40 |  7 |  40 | f
    2 |          2 |  1 |   2 | f
    2 |         10 |  2 |   2 | t
    2 |         13 |  3 |   2 | t
(10 rows)

答案 1 :(得分:1)

这更像是一个递归问题,而不是窗口函数。以下查询获得了所需的结果:

WITH RECURSIVE base(type, time_stamp) AS (

  -- 3. base of recursive query
  SELECT x.type, x.time_stamp, y.next_time_stamp
    FROM 
         -- 1. start with the initial records of each type   
         ( SELECT type, min(time_stamp) AS time_stamp
             FROM event
             GROUP BY type
         ) x
         LEFT JOIN LATERAL
         -- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
         ( SELECT MIN(time_stamp) next_time_stamp
             FROM event
             WHERE type = x.type
               AND time_stamp > (x.time_stamp + 10)
         ) y ON true

  UNION ALL

  -- 4. recursive join, same logic as base
  SELECT e.type, e.time_stamp, z.next_time_stamp
    FROM event e
    JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
    LEFT JOIN LATERAL
    ( SELECT MIN(time_stamp) next_time_stamp
       FROM event
       WHERE type = e.type
         AND time_stamp > (e.time_stamp + 10)
    ) z ON true

)

-- The actual query:

-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
  FROM base

UNION

-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
  FROM event
  WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base) 

ORDER BY type, time_stamp

这有很多警告。对于给定的time_stamp,它假定没有重复type。实际上,联接应该基于唯一ID,而不是typetime_stamp。我没有对此进行过多次测试,但它至少可以提出一种方法。

这是我第一次尝试LATERAL加入。所以可能有一种方法来简化那个萌。我真正想要做的是使用基于MIN(time_stamp)的{​​{1}}递归部分的递归CTE,但CTE不允许以这种方式使用聚合函数。但似乎横向连接可用于CTE。

答案 2 :(得分:1)

递归方法的替代方法是自定义聚合。一旦掌握了编写自己的聚合的技术,创建转换和最终函数就很容易和合乎逻辑。

州过渡职能:

()

最终功能:

create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
    if st is null or st[1] + timeframe <= time_stamp
    then 
        st[1] := time_stamp;
    end if;
    st[2] := time_stamp;
    return st;
end $$;

聚合:

create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
    select st[1] <> st[2];
$$;

查询:

create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
    sfunc = is_duplicate,
    stype = int[],
    finalfunc = is_duplicate_final
);

请阅读文档:37.10. User-defined AggregatesCREATE AGGREGATE.