折叠窗口间隔

时间:2017-12-15 07:54:40

标签: mysql sql presto

我有如下数据库:

select * from (
  values ('A', 1, 2), ('A', 2, 3), ('A', 3, 4),
         ('B', 4, 5), ('B', 5, 6), ('A', 6, 7),
         ('C', 7, 8), ('C', 8, 9)
) example_table("state", "start", "end")

-- example table: 
-- state start stop
--     A    t1   t2
--     A    t2   t3
--     A    t3   t4
--     B    t4   t5
--     B    t5   t6
--     A    t6   t7
--     C    t7   t8
--     C    t8   t9

我希望将间隔按state折叠,同时还要尊重干预状态切换:

state start stop
    A    t1   t4
    B    t4   t6
    A    t6   t7
    C    t7   t9

由于给定group by state的输出中有多行,因此仅使用state将无效。

它似乎是window functions的一个用例,但我不确定对行进行分区的内容。

我想创建一个像group_id这样的中间步骤:

group state start stop
    1     A    t1   t2
    1     A    t2   t3
    1     A    t3   t4
    2     B    t4   t5
    2     B    t5   t6
    3     A    t6   t7
    4     C    t7   t8
    4     C    t8   t9

然后我可以按group进行分组并选择min(start)max(stop),但我不知道如何创建此变量(有效)。在R中,我会使用rle函数来执行此操作,但我不知道任何Presto等效函数。

1 个答案:

答案 0 :(得分:0)

This answer has been updated to reflect the following successful answer

with example_table("state", "start", "end") as (
  values ('A', 1, 2), ('A', 2, 3), ('A', 3, 4),
         ('B', 4, 5), ('B', 5, 6), ('A', 6, 7),
         ('C', 7, 8), ('C', 8, 9)
), table_with_lags as (
  -- detect state changes by observing the lagged value
  select *, lag(state) over(order by start) as lag_state,
  -- need to track the final value since it may be lost below
  last_value("end") over(order by start rows between 
                         0 preceding and unbounded following)
  as end_period
  from example_table
)
select state, start, 
       -- force-re-establish the start(+1) = end(0) link;
       --   at the end of the period, override this with the
       --   final observed value instead of null
       lead(start, 1, end_period) over(order by start) as "end"
from table_with_lags
-- lag_state will be null for the first row
where state <> lag_state or lag_state is null
order by start

With output:

state start stop
    A     1    4
    B     4    6
    A     6    7
    C     7    9

The rows between 0 preceding and unbounded following bit is a bit verbose, so you could also flip the logic and do:

table_with_leads as (
  select state, start, "end", 
         lead(state) over(order by start) as lead_state,
         first_value(start) over(order by start) as start_period
  from example_table
)
select state, lag("end", 1, start_period) over(order by start) as start, "end"
from table_with_lags
where state <> lead_state or lead_state is null
order by start

original answer

The following works, but is not performant at scale (even on a 10% subsample of the data, I get an "exceeded local memory limit" error):

with switches as (
  -- coalesce since the first row will be NULL, need it false
  select *, coalesce(state <> lag(state) over(order by start), false) switched
  from (
    values ('A', 1, 2), ('A', 2, 3), ('A', 3, 4),
           ('B', 4, 5), ('B', 5, 6), ('A', 6, 7),
           ('C', 7, 8), ('C', 8, 9)
  ) example_table("state", "start", "stop")
), groups as (
  -- create the group ID as the accumulation of the state switches
  --   since only one state switch can happen per group
  select *, sum(cast(switched as bigint)) over (order by start) group_id
  from switches
)
select min(state) state, min(start) start, max(stop) stop
from groups group by group_id order by start;

-- state start stop
--     A     1    4
--     B     4    6
--     A     6    7
--     C     7    9

I have currently found success by temporarily storing groups as a table, then separately selecting from groups, which appears to get around the RAM issue (somewhat surprisingly to me). This hardly seems ideal, but as it gets the job done, I'm content to stick with this for now.