取第一个,第二个,第三个......最后一个值并选择行(具有过滤器和滞后的窗口函数)

时间:2018-02-13 15:45:14

标签: sql postgresql window-functions postgresql-9.6

我想用filter子句执行一个窗口函数,例如:

+----+------+------+
| id | type | date |
+----+------+------+
|  1 | A    |   30 |
|  1 | A    |   25 |
|  1 | A    |   20 |
|  1 | B    |   29 |
|  1 | B    |   28 |
|  1 | B    |   21 |
|  1 | C    |   24 |
|  1 | C    |   22 |
+----+------+------+

但是,Postgres不支持此操作,但我无法确定其他方法。详情如下

Challange

输入+----+------+------+---------+---------+---------+---------+---------+---------+ | id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 | +----+------+------+---------+---------+---------+---------+---------+---------+ | 1 | A | 30 | 25 | 20 | 29 | 28 | 24 | 22 | | 1 | A | 25 | 20 | | | | 24 | 22 | | 1 | A | 20 | | | | | | | | 1 | B | 29 | 25 | 20 | 28 | 21 | 24 | 22 | | 1 | B | 28 | 25 | 20 | 21 | | 24 | 22 | | 1 | B | 21 | 20 | | | | 24 | 22 | | 1 | C | 24 | 20 | | 21 | | 22 | | | 1 | C | 22 | 20 | | 21 | | | | +----+------+------+---------+---------+---------+---------+---------+---------+

date

期望的输出:

type

用语言说:

  • 对于每一行,选择之前发生的所有行(请参阅date列)
  • 然后,对于每个A_lag_1('A','B','C'),将最新A_lag_2放入type,将第二个放入最近(按日期)值在B_lag_1 B_lag_2'A',idA_lag_X代表'B'等。

以上示例非常简化,在我的实际用例中,将会有更多tab_A值,更多滞后列迭代SELECT id, type, "date", LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1", LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2", LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1", LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2", LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1", LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2" FROM tab_A 和类型。

可行的解决方案 这个挑战似乎非常适合O(1),因为我希望保留相同数量的行max并附加与行相关但过去相关的信息。

所以使用窗口函数(window function)构建它:

date|sched_dep_time|dep_time|sched_arr_time|arr_time
1/1/2013|515|517|819|830

但是,我收到以下错误:

  

错误:非聚合窗口函数未实现FILTER   位置:30

虽然sqlfiddle中引用了此错误,但我无法确定另一种方法。

非常感谢任何帮助。

其他SO问题:

  • documentation此答案依赖于使用rdd.flatMap(_._2).sum 等聚合函数。但是,当尝试检索第二行,第三行等时,这将无效。

3 个答案:

答案 0 :(得分:0)

您可以尝试以下内容。

SELECT
dt.* ,
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND dt.A_lag_1 >  b.dateVAL  ) AS "A_lag_2",
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND dt.B_lag_1 >  b.dateVAL  ) AS "B_lag_2" ,
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND dt.C_lag_1 >  b.dateVAL  ) AS "C_lag_2"
FROM
(
SELECT
  a.id, a.type, a.dateVAL,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND a.dateVAL >  b.dateVAL  )  as A_lag_1,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND a.dateVAL >  b.dateVAL  )  as B_lag_1,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND a.dateVAL >  b.dateVAL  )  as C_lag_1
FROM tab_A a
)   dt

以下是Fiddle链接。这可能不是最有效的方法。

答案 1 :(得分:0)

另一种使用横向连接的可能解决方案(fiddle):

SELECT
    a.id,
    a.type,
    a."date",
    c.nn_row,
    c.type,
    c."date" as "date_joined"
FROM tab_A AS a
LEFT JOIN LATERAL (
    SELECT
        type,
        "date",
        row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
    FROM tab_A AS b
    WHERE a."date" > b."date"
) AS c on true
WHERE c.nn_row <= 5

这会创建一个长表,如:

+----+------+------+--------+------+-------------+
| id | type | date | nn_row | type | date_joined |
+----+------+------+--------+------+-------------+
|  1 | A    |   30 |      1 | A    |          25 |
|  1 | A    |   30 |      2 | A    |          20 |
|  1 | A    |   30 |      1 | B    |          29 |
|  1 | A    |   30 |      2 | B    |          28 |
|  1 | A    |   30 |      3 | B    |          21 |
|  1 | A    |   30 |      1 | C    |          24 |
|  1 | A    |   30 |      2 | C    |          22 |
|  1 | A    |   25 |      1 | A    |          20 |
|  1 | A    |   25 |      1 | B    |          21 |
|  1 | A    |   25 |      1 | C    |          24 |
|  1 | A    |   25 |      2 | C    |          22 |
|  1 | B    |   29 |      1 | A    |          25 |
|  1 | B    |   29 |      2 | A    |          20 |
|  1 | B    |   29 |      1 | B    |          28 |
|  1 | B    |   29 |      2 | B    |          21 |
|  1 | B    |   29 |      1 | C    |          24 |
|  1 | B    |   29 |      2 | C    |          22 |
|  1 | B    |   28 |      1 | A    |          25 |
|  1 | B    |   28 |      2 | A    |          20 |
|  1 | B    |   28 |      1 | B    |          21 |
|  1 | B    |   28 |      1 | C    |          24 |
|  1 | B    |   28 |      2 | C    |          22 |
|  1 | B    |   21 |      1 | A    |          20 |
|  1 | C    |   24 |      1 | A    |          20 |
|  1 | C    |   24 |      1 | B    |          21 |
|  1 | C    |   24 |      1 | C    |          22 |
|  1 | C    |   22 |      1 | A    |          20 |
|  1 | C    |   22 |      1 | B    |          21 |
+----+------+------+--------+------+-------------+

之后您可以转到所需的输出。

然而,这对我来说只是一个很小的样本,但是在整个表格中Postgres用尽了磁盘空间(即使我有50GB可用):

  

错误:无法写入散列连接临时文件:设备上没有剩余空间

我已在此处发布此解决方案,因为它可能适用于拥有较小表格的其他人

答案 2 :(得分:0)

由于FILTER子句适用于聚合函数,我决定write my own

----- N = 1
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_1(agg_state point, el float)
    returns point
    immutable
    language plpgsql
    as $$
declare
    i integer;
    stored_value float;
begin
    i := agg_state[0];
    stored_value := agg_state[1];

    i := i + 1; -- First row i=1
    if i = 1 then
        stored_value := el;
    end if;
    return point(i, stored_value);
end;
$$;

-- Final function
--DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
create or replace function lag_agg_ffunc_1(agg_state point)
    returns float
    immutable
    strict
    language plpgsql
    as $$
begin
  return agg_state[1];
end;
$$;

-- Aggregate function
drop aggregate if exists lag_agg_1(double precision);
create aggregate lag_agg_1 (float) (
    sfunc = lag_agg_sfunc_1,
    stype = point,
    finalfunc = lag_agg_ffunc_1,
    initcond = '(0,0)'
);


----- N = 2
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_2(agg_state point, el float)
    returns point
    immutable
    language plpgsql
    as $$
declare
    i integer;
    stored_value float;
begin
    i := agg_state[0];
    stored_value := agg_state[1];

    i := i + 1; -- First row i=1
    if i = 2 then
        stored_value := el;
    end if;
    return point(i, stored_value);
end;
$$;

-- Final function
--DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
create or replace function lag_agg_ffunc_2(agg_state point)
    returns float
    immutable
    strict
    language plpgsql
    as $$
begin
  return agg_state[1];
end;
$$;

-- Aggregate function
drop aggregate if exists lag_agg_2(double precision);
create aggregate lag_agg_2 (float) (
    sfunc = lag_agg_sfunc_2,
    stype = point,
    finalfunc = lag_agg_ffunc_2,
    initcond = '(0,0)'
);

您可以将上述聚合函数lag_agg_1lag_agg_2与原始问题中的窗口表达式一起使用:

SELECT
  id, type, "date",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
FROM tab_A
ORDER BY id ASC, type, "date" DESC

与其他选项相比,这可以相当快地执行。一些可以改进的事情:

  • 我无法确定如何正确使用空值,因此最后通过将所有0转换为NULL来捏造结果。这会在某些情况下引起问题
  • 我刚刚复制并粘贴了每个lag_X的函数,因为我无法确定如何参数化

非常感谢上述任何帮助