我想用filter子句执行一个窗口函数,例如:
+----+------+------+
| id | type | date |
+----+------+------+
| 1 | A | 30 |
| 1 | A | 25 |
| 1 | A | 20 |
| 1 | B | 29 |
| 1 | B | 28 |
| 1 | B | 21 |
| 1 | C | 24 |
| 1 | C | 22 |
+----+------+------+
但是,Postgres不支持此操作,但我无法确定其他方法。详情如下
Challange
输入+----+------+------+---------+---------+---------+---------+---------+---------+
| id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 |
+----+------+------+---------+---------+---------+---------+---------+---------+
| 1 | A | 30 | 25 | 20 | 29 | 28 | 24 | 22 |
| 1 | A | 25 | 20 | | | | 24 | 22 |
| 1 | A | 20 | | | | | | |
| 1 | B | 29 | 25 | 20 | 28 | 21 | 24 | 22 |
| 1 | B | 28 | 25 | 20 | 21 | | 24 | 22 |
| 1 | B | 21 | 20 | | | | 24 | 22 |
| 1 | C | 24 | 20 | | 21 | | 22 | |
| 1 | C | 22 | 20 | | 21 | | | |
+----+------+------+---------+---------+---------+---------+---------+---------+
:
date
期望的输出:
type
用语言说:
date
列)A_lag_1
('A','B','C'),将最新A_lag_2
放入type
,将第二个放入最近(按日期)值在B_lag_1
B_lag_2
'A',id
,A_lag_X
代表'B'等。以上示例非常简化,在我的实际用例中,将会有更多tab_A
值,更多滞后列迭代SELECT
id, type, "date",
LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1",
LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2",
LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1",
LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2",
LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1",
LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2"
FROM tab_A
和类型。
可行的解决方案
这个挑战似乎非常适合O(1)
,因为我希望保留相同数量的行max
并附加与行相关但过去相关的信息。
所以使用窗口函数(window function)构建它:
date|sched_dep_time|dep_time|sched_arr_time|arr_time
1/1/2013|515|517|819|830
但是,我收到以下错误:
错误:非聚合窗口函数未实现FILTER 位置:30
虽然sqlfiddle中引用了此错误,但我无法确定另一种方法。
非常感谢任何帮助。
其他SO问题:
rdd.flatMap(_._2).sum
等聚合函数。但是,当尝试检索第二行,第三行等时,这将无效。答案 0 :(得分:0)
您可以尝试以下内容。
SELECT
dt.* ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND dt.A_lag_1 > b.dateVAL ) AS "A_lag_2",
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND dt.B_lag_1 > b.dateVAL ) AS "B_lag_2" ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND dt.C_lag_1 > b.dateVAL ) AS "C_lag_2"
FROM
(
SELECT
a.id, a.type, a.dateVAL,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND a.dateVAL > b.dateVAL ) as A_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND a.dateVAL > b.dateVAL ) as B_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND a.dateVAL > b.dateVAL ) as C_lag_1
FROM tab_A a
) dt
以下是Fiddle链接。这可能不是最有效的方法。
答案 1 :(得分:0)
另一种使用横向连接的可能解决方案(fiddle):
SELECT
a.id,
a.type,
a."date",
c.nn_row,
c.type,
c."date" as "date_joined"
FROM tab_A AS a
LEFT JOIN LATERAL (
SELECT
type,
"date",
row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
FROM tab_A AS b
WHERE a."date" > b."date"
) AS c on true
WHERE c.nn_row <= 5
这会创建一个长表,如:
+----+------+------+--------+------+-------------+
| id | type | date | nn_row | type | date_joined |
+----+------+------+--------+------+-------------+
| 1 | A | 30 | 1 | A | 25 |
| 1 | A | 30 | 2 | A | 20 |
| 1 | A | 30 | 1 | B | 29 |
| 1 | A | 30 | 2 | B | 28 |
| 1 | A | 30 | 3 | B | 21 |
| 1 | A | 30 | 1 | C | 24 |
| 1 | A | 30 | 2 | C | 22 |
| 1 | A | 25 | 1 | A | 20 |
| 1 | A | 25 | 1 | B | 21 |
| 1 | A | 25 | 1 | C | 24 |
| 1 | A | 25 | 2 | C | 22 |
| 1 | B | 29 | 1 | A | 25 |
| 1 | B | 29 | 2 | A | 20 |
| 1 | B | 29 | 1 | B | 28 |
| 1 | B | 29 | 2 | B | 21 |
| 1 | B | 29 | 1 | C | 24 |
| 1 | B | 29 | 2 | C | 22 |
| 1 | B | 28 | 1 | A | 25 |
| 1 | B | 28 | 2 | A | 20 |
| 1 | B | 28 | 1 | B | 21 |
| 1 | B | 28 | 1 | C | 24 |
| 1 | B | 28 | 2 | C | 22 |
| 1 | B | 21 | 1 | A | 20 |
| 1 | C | 24 | 1 | A | 20 |
| 1 | C | 24 | 1 | B | 21 |
| 1 | C | 24 | 1 | C | 22 |
| 1 | C | 22 | 1 | A | 20 |
| 1 | C | 22 | 1 | B | 21 |
+----+------+------+--------+------+-------------+
之后您可以转到所需的输出。
然而,这对我来说只是一个很小的样本,但是在整个表格中Postgres用尽了磁盘空间(即使我有50GB可用):
错误:无法写入散列连接临时文件:设备上没有剩余空间
我已在此处发布此解决方案,因为它可能适用于拥有较小表格的其他人
答案 2 :(得分:0)
由于FILTER
子句适用于聚合函数,我决定write my own。
----- N = 1
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_1(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 1 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
create or replace function lag_agg_ffunc_1(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_1(double precision);
create aggregate lag_agg_1 (float) (
sfunc = lag_agg_sfunc_1,
stype = point,
finalfunc = lag_agg_ffunc_1,
initcond = '(0,0)'
);
----- N = 2
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_2(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 2 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
create or replace function lag_agg_ffunc_2(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_2(double precision);
create aggregate lag_agg_2 (float) (
sfunc = lag_agg_sfunc_2,
stype = point,
finalfunc = lag_agg_ffunc_2,
initcond = '(0,0)'
);
您可以将上述聚合函数lag_agg_1
和lag_agg_2
与原始问题中的窗口表达式一起使用:
SELECT
id, type, "date",
NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
FROM tab_A
ORDER BY id ASC, type, "date" DESC
与其他选项相比,这可以相当快地执行。一些可以改进的事情:
非常感谢上述任何帮助