我有如下数据库:
select * from (
values ('A', 1, 2), ('A', 2, 3), ('A', 3, 4),
('B', 4, 5), ('B', 5, 6), ('A', 6, 7),
('C', 7, 8), ('C', 8, 9)
) example_table("state", "start", "end")
-- example table:
-- state start stop
-- A t1 t2
-- A t2 t3
-- A t3 t4
-- B t4 t5
-- B t5 t6
-- A t6 t7
-- C t7 t8
-- C t8 t9
我希望将间隔按state
折叠,同时还要尊重干预状态切换:
state start stop
A t1 t4
B t4 t6
A t6 t7
C t7 t9
由于给定group by state
的输出中有多行,因此仅使用state
将无效。
它似乎是window functions的一个用例,但我不确定对行进行分区的内容。
我想创建一个像group_id
这样的中间步骤:
group state start stop
1 A t1 t2
1 A t2 t3
1 A t3 t4
2 B t4 t5
2 B t5 t6
3 A t6 t7
4 C t7 t8
4 C t8 t9
然后我可以按group
进行分组并选择min(start)
和max(stop)
,但我不知道如何创建此变量(有效)。在R中,我会使用rle
函数来执行此操作,但我不知道任何Presto等效函数。
答案 0 :(得分:0)
This answer has been updated to reflect the following successful answer
with example_table("state", "start", "end") as (
values ('A', 1, 2), ('A', 2, 3), ('A', 3, 4),
('B', 4, 5), ('B', 5, 6), ('A', 6, 7),
('C', 7, 8), ('C', 8, 9)
), table_with_lags as (
-- detect state changes by observing the lagged value
select *, lag(state) over(order by start) as lag_state,
-- need to track the final value since it may be lost below
last_value("end") over(order by start rows between
0 preceding and unbounded following)
as end_period
from example_table
)
select state, start,
-- force-re-establish the start(+1) = end(0) link;
-- at the end of the period, override this with the
-- final observed value instead of null
lead(start, 1, end_period) over(order by start) as "end"
from table_with_lags
-- lag_state will be null for the first row
where state <> lag_state or lag_state is null
order by start
With output:
state start stop
A 1 4
B 4 6
A 6 7
C 7 9
The rows between 0 preceding and unbounded following
bit is a bit verbose, so you could also flip the logic and do:
table_with_leads as (
select state, start, "end",
lead(state) over(order by start) as lead_state,
first_value(start) over(order by start) as start_period
from example_table
)
select state, lag("end", 1, start_period) over(order by start) as start, "end"
from table_with_lags
where state <> lead_state or lead_state is null
order by start
The following works, but is not performant at scale (even on a 10% subsample of the data, I get an "exceeded local memory limit" error):
with switches as (
-- coalesce since the first row will be NULL, need it false
select *, coalesce(state <> lag(state) over(order by start), false) switched
from (
values ('A', 1, 2), ('A', 2, 3), ('A', 3, 4),
('B', 4, 5), ('B', 5, 6), ('A', 6, 7),
('C', 7, 8), ('C', 8, 9)
) example_table("state", "start", "stop")
), groups as (
-- create the group ID as the accumulation of the state switches
-- since only one state switch can happen per group
select *, sum(cast(switched as bigint)) over (order by start) group_id
from switches
)
select min(state) state, min(start) start, max(stop) stop
from groups group by group_id order by start;
-- state start stop
-- A 1 4
-- B 4 6
-- A 6 7
-- C 7 9
I have currently found success by temporarily storing groups
as a table, then separately select
ing from groups
, which appears to get around the RAM issue (somewhat surprisingly to me). This hardly seems ideal, but as it gets the job done, I'm content to stick with this for now.