在SQL中对聚合的连续块进行分组(Redshift)

时间:2017-11-21 22:11:04

标签: sql amazon-redshift

我有一张这样的桌子:

    id time activity
 1:  1    1        a
 2:  1    2        a
 3:  1    3        b
 4:  1    4        b
 5:  1    5        a
 6:  2    1        a
 7:  2    2        b
 8:  2    3        b
 9:  2    4        b
10:  2    5        a
11:  2    6        a
12:  2    7        c
13:  2    8        c
14:  2    9        c

在每个id中,我想通过activity的连续块进行聚合。基本上我想要一个grouping列,如下所示:

    id time activity grouping
 1:  1    1        a        1
 2:  1    2        a        1
 3:  1    3        b        2
 4:  1    4        b        2
 5:  1    5        a        3
 6:  2    1        a        1
 7:  2    2        b        2
 8:  2    3        b        2
 9:  2    4        b        2
10:  2    5        a        3
11:  2    6        a        3
12:  2    7        c        4
13:  2    8        c        4
14:  2    9        c        4

这样我就可以使用聚合函数并得到类似的东西:

select id
, min(time) as min_time
, max(time) as max_time
, count(*) as n_activity
from A
group by id, grouping

   id min_time max_time n_activity
1:  1        1        2          2
2:  1        3        4          2
3:  1        5        5          1
4:  2        1        1          1
5:  2        2        4          3
6:  2        5        6          2
7:  2        7        9          3

如何创建分组列?我的表非常大,所以我希望尽可能避免使用游标功能。

一些示例数据:

create table A (id int, time int, activity varchar);
insert into A (id, time, activity)
values
(1,1,'a'),(1,2,'a'),(1,3,'b'),(1,4,'b'),(1,5,'a'),(2,1,'a'),
(2,2,'b'),(2,3,'b'),(2,4,'b'),(2,5,'a'),(2,6,'a'),(2,7,'c'),
(2,8,'c'),(2,9,'c')

2 个答案:

答案 0 :(得分:3)

使用lag检查前一行是否与当前行具有相同的活动,如果它没有用运行总和重置它。

select t.*,sum(case when prev_activity=activity then 0 else 1 end) over(partition by id order by time) as grp
from (
select t.*,lag(activity) over(partition by id order by time) as prev_activity
from tbl t
) t 

答案 1 :(得分:1)

应该只能使用time中的ROW_NUMBER()值和次要数字序列吗?

SELECT
  *,
  time - ROW_NUMBER() OVER (PARTITION BY id, activity
                                ORDER BY time        )   AS rownum
FROM
  yourTable

字段(id,activity,rownum)为您的论坛提供了一个复合键。

如果你真的需要它作为单个字段标识符,你可以将DENSE_RANK() OVER (PARTITION BY id ORDER BY rownum, activity DESC) aroudn包裹起来。

    id time activity   rownum  (time-rownum) (composite key) (dense_rank)

 1:  1    1        a    1                  0         (1,a,0)       1
 2:  1    2        a    2                  0         (1,a,0)       1
 3:  1    3        b      1                2         (1,b,2)       2
 4:  1    4        b      2                2         (1,b,2)       2
 5:  1    5        a    3                  2         (1,a,2)       3

 6:  2    1        a    1                  0         (2,a,0)       1
 7:  2    2        b      1                1         (2,b,1)       2
 8:  2    3        b      2                1         (2,b,1)       2
 9:  2    4        b      3                1         (2,b,1)       2
10:  2    5        a    2                  3         (2,a,3)       3
11:  2    6        a    3                  3         (2,a,3)       3
12:  2    7        c        1              6         (2,c,6)       4
13:  2    8        c        2              6         (2,c,6)       4
14:  2    9        c        3              6         (2,c,6)       4

将复合键应用于聚合示例...

SELECT
    id
  , min(time) as min_time
  , max(time) as max_time
  , count(*) as n_activity
FROM
(
  SELECT
    *,
    time - ROW_NUMBER() OVER (PARTITION BY id, activity
                                  ORDER BY time        )   AS rownum
  FROM
    yourTable
)
  partitioned
GROUP BY
  id, activity, rownum

如果时间排序,但并不总是连续的,那就变成......

SELECT
    id
  , min(time) as min_time
  , max(time) as max_time
  , count(*) as n_activity
FROM
(
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id
                           ORDER BY time        )
    -
    ROW_NUMBER() OVER (PARTITION BY id, activity
                           ORDER BY time        )   AS rownum
  FROM
    yourTable
)
  partitioned
GROUP BY
  id, activity, rownum