获取具有日期范围的自定义聚合的增量

时间:2019-03-13 21:30:49

标签: sql postgresql

我需要找到一种创建查询的有效方法,以报告聚合的增量以及该值的开始和结束日期。

要求

  • 源表包括开始日期,结束日期,类别ID,子类别ID和一个指示器(如果该子类别处于活动状态)。
  • 聚集是对cat_id上的is_active的,并且函数的结果应为1,只要is_active的任何sub_category也为1。
  • 如果连续日期范围内汇总函数的结果相同,则应合并日期范围以减少结果集。
  • 类别/子类别组合永远不会有重叠的日期,但是其他子类别可能会跨越彼此的边界。

我尝试过的事情

我尝试创建一个CTE,为一个类别生成所有可能的范围,然后重新加入主查询,以便分解跨越多个范围的子类别。然后,我按范围分组并进行了MAX(is_active)。

虽然这是一个很好的开始(此时我要做的就是将具有相同值的连续范围合并在一起),但查询速度异常慢。我对Postgres的了解不如对其他类型的SQL熟悉,因此我决定将时间花在更好地接触并从经验丰富的人那里获取帮助上。

源数据

+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| id | start_dt   | end_dt     | cat_id | sub_cat_id | is_active | comment                                             |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| 1  | 2018-01-01 | 2018-01-31 | 1      | 1001       | 1         | (null)                                              |
| 2  | 2018-02-01 | 2018-02-14 | 1      | 1001       | 0         | (null)                                              |
| 3  | 2018-02-15 | 2018-02-28 | 1      | 1001       | 0         | cat 1 is_active is unchanged despite new record.    |
| 4  | 2018-03-01 | 2018-03-30 | 1      | 1001       | 1         | (null)                                              |
| 5  | 2018-01-01 | 2018-01-15 | 2      | 2001       | 1         | (null)                                              |
| 6  | 2018-01-01 | 2018-01-31 | 2      | 2002       | 1         | (null)                                              |
| 7  | 2018-01-15 | 2018-02-10 | 2      | 2001       | 0         | cat 2 should still be active until 2002 is inactive |
| 8  | 2018-02-01 | 2018-02-14 | 2      | 2002       | 0         | cat 2 is inactive                                   |
| 9  | 2018-02-10 | 2018-03-15 | 2      | 2001       | 0         | this record will cause trouble                      |
| 10 | 2018-02-15 | 2018-03-30 | 2      | 2002       | 1         | cat 2 should be active again                        |
| 11 | 2018-03-15 | 2018-03-30 | 2      | 2001       | 1         | cat 2 is_active is unchanged despite new record.    |
| 12 | 2018-04-01 | 2018-04-30 | 2      | 2001       | 0         | cat 2 ends in a zero                                |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+

预期结果

+------------+------------+--------+-----------+
| start_dt   | end_dt     | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 | 1      | 1         |
| 2018-02-01 | 2018-02-28 | 1      | 0         |
| 2018-03-01 | 2018-03-30 | 1      | 1         |
| 2018-01-01 | 2018-01-31 | 2      | 1         |
| 2018-02-01 | 2018-02-14 | 2      | 0         |
| 2018-02-15 | 2018-03-30 | 2      | 1         |
| 2018-04-01 | 2018-04-30 | 2      | 0         |
+------------+------------+--------+-----------+

这是一条select语句,可以帮助您编写自己的测试。

SELECT id,start_dt::date start_date,end_dt::date end_date,cat_id,sub_cat_id,is_active::int is_active,comment
FROM (VALUES 
    (1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
    (2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
    (3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
    (4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
    (5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
    (6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
    (7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
    (8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
    (9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
    (10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
    (11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
    (12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')

) src ( "id","start_dt","end_dt","cat_id","sub_cat_id","is_active","comment" )

2 个答案:

答案 0 :(得分:1)

因此,如果给定日期处于活动状态,则该日期为活动状态。 换句话说,如果至少一个子类别是活动的,则该日期被认为是活动的。 如果在给定日期没有活动的子类别,则该日期为无效。 起初在最初的问题中我不清楚这一逻辑。


我提到了Itzik Ben-Gan Packing Intervals的一篇文章,这是一种处理它的方法。

使用这种方法,您可以打包所有活动间隔,而完全忽略不活动间隔。填充活动间隔后剩余的间隙将不活动。

如果您从未有过既不活跃也不活跃的日期,这是最终答案。 如果可以有这样的“不确定”日期,则事情可能会变得棘手。


一种完全不同的方法是使用日历表(永久表或即时生成的一系列日期)。将原始表的每一行与日历表连接起来以展开它,并在给定间隔内为每个日期排成一行。

然后按“类别”和“日期”将它们全部分组,并将is_active标志设置为MAX(如果至少一个子类别在该日期具有is_active = 1,则MAX也会为1,即也处于活动状态)。

这种方法更易于理解,并且如果间隔的长度不太长,则应该可以很好地工作。

类似这样的东西:

SELECT
    Calendar.dt
    ,src.cat_id
    ,MAX(src.is_active) AS is_active
    -- we don't even need to know sub_cat_id
FROM
    src
    INNER JOIN Calendar
        ON  Calendar.dt >= src.start_dt
        AND Calendar.dt <= src.end_dt
GROUP BY
    Calendar.dt
    ,src.cat_id

因此,每个日期和类别将获得一行。现在,您需要将连续的日期合并回间隔中。您可以再次使用“装箱间隔”方法,也可以使用一些更简单的间隙和孤岛变化。

样本数据

WITH src AS
(
    SELECT id,start_dt::date start_dt,end_dt::date end_dt,cat_id,sub_cat_id,is_active,comment
    FROM (VALUES 
        (1,  '2018-01-01', '2018-01-31', 1, 1001, 1, null),
        (2,  '2018-02-01', '2018-02-14', 1, 1001, 0, null),
        (3,  '2018-02-15', '2018-02-28', 1, 1001, 0, 'cat 1 is_active is unchanged despite new record.'),
        (4,  '2018-03-01', '2018-03-30', 1, 1001, 1, null),
        (5,  '2018-01-01', '2018-01-15', 2, 2001, 1, null),
        (6,  '2018-01-01', '2018-01-31', 2, 2002, 1, null),
        (7,  '2018-01-15', '2018-02-10', 2, 2001, 0, 'cat 2 should still be active until 2002 is inactive'),
        (8,  '2018-02-01', '2018-02-14', 2, 2002, 0, 'cat 2 is inactive'),
        (9,  '2018-02-10', '2018-03-15', 2, 2001, 0, 'cat 2 is_active is unchanged despite new record.'),
        (10, '2018-02-15', '2018-03-30', 2, 2002, 1, 'cat 2 should be active agai'),
        (11, '2018-03-15', '2018-03-30', 2, 2001, 1, 'cat 2 is_active is unchanged despite new record.'),
        (12, '2018-04-01', '2018-04-30', 2, 2001, 0, 'cat 2 ends in 0.')
    ) src ( id,start_dt,end_dt,cat_id,sub_cat_id,is_active,comment)
)
,Calendar AS
(
    -- OP Note: Union of all dates from source produced 30% faster results.
    -- OP Note 2: Including the cat_id (which was indexed FK), Made Query 8x faster.
    SELECT cat_id, start_dt dt FROM src
    UNION SELECT cat_id, end_dt dt FROM src 
    /*SELECT dt::date dt
    FROM (
        SELECT MIN(start_dt) min_start, MAX(end_dt) max_end
        FROM src
    ) max_ranges
    CROSS JOIN generate_series(min_start, max_end, '1 day'::interval) dt*/
)

主要查询

检查每个中间CTE的结果以充分了解其工作原理。

-- expand intervals into individual dates
,CTE_Dates
AS
(
    SELECT
        Calendar.dt
        ,src.cat_id
        ,MAX(src.is_active) AS is_active
        -- we don't even need to know sub_cat_id
    FROM
        src
        INNER JOIN Calendar
            ON  Calendar.dt >= src.start_dt
            AND Calendar.dt <= src.end_dt
            AND Calender.cat_id = src.cat_id
    GROUP BY
        Calendar.dt
        ,src.cat_id
)
-- simple gaps-and-islands
,CTE_rn
AS
(
    SELECT
        *
        ,ROW_NUMBER() OVER (PARTITION BY cat_id ORDER BY dt) AS rn1
        ,ROW_NUMBER() OVER (PARTITION BY cat_id, is_active ORDER BY dt) AS rn2
    FROM CTE_Dates
)
-- diff of row numbers gives us a group's "ID"
-- condense each island and gap back into interval using simple GROUP BY
SELECT
    MIN(dt) AS start_dt
    ,MAX(dt) AS end_dt
    ,cat_id
    ,is_active
FROM CTE_rn
GROUP BY
    cat_id
    ,is_active
    ,rn1 - rn2
ORDER BY
    cat_id
    ,start_dt
;

没有通用日历的第二个变体

它的执行效果可能更好,因为此变体不必扫描src表(两次)即可创建一个临时日期列表,对该列表进行排序以删除重复项,然后就没有该临时表的联接了最有可能没有任何支持索引的日期列表。 但是,它会生成更多行。

-- remove Calendar CTE above, 
-- use generate_series() to generate the exact range of dates we need 
-- without joining to generic Calendar table

-- expand intervals into individual dates
,CTE_Dates
AS
(
    SELECT
        Dates.dt
        ,src.cat_id
        ,MAX(src.is_active) AS is_active
        -- we don't even need to know sub_cat_id
    FROM
        src
        INNER JOIN LATERAL
        (
            SELECT dt::date
            FROM generate_series(src.start_dt, src.end_dt, '1 day'::interval) AS s(dt)
        ) AS Dates ON true
    GROUP BY
        Dates.dt
        ,src.cat_id
)
-- simple gaps-and-islands
,CTE_rn
AS
(
    SELECT
        *
        ,ROW_NUMBER() OVER (PARTITION BY cat_id ORDER BY dt) AS rn1
        ,ROW_NUMBER() OVER (PARTITION BY cat_id, is_active ORDER BY dt) AS rn2
    FROM CTE_Dates
)
-- diff of row numbers gives us a group's "ID"
-- condense each island and gap back into interval using simple GROUP BY
SELECT
    MIN(dt) AS start_dt
    ,MAX(dt) AS end_dt
    ,cat_id
    ,is_active
FROM CTE_rn
GROUP BY
    cat_id
    ,is_active
    ,rn1 - rn2
ORDER BY
    cat_id
    ,start_dt
;

结果

+------------+------------+--------+-----------+
|  start_dt  |   end_dt   | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 |      1 |         1 |
| 2018-02-01 | 2018-02-28 |      1 |         0 |
| 2018-03-01 | 2018-03-30 |      1 |         1 |
| 2018-01-01 | 2018-01-31 |      2 |         1 |
| 2018-02-01 | 2018-02-14 |      2 |         0 |
| 2018-02-15 | 2018-03-30 |      2 |         1 |
| 2018-04-01 | 2018-04-30 |      2 |         0 |
+------------+------------+--------+-----------+

此外,众所周知,CTE在Postgres中是“优化障碍”,因此,如果将这些CTE内联到单个查询中,其性能可能会改变。您需要使用数据在系统上进行测试。

答案 1 :(得分:1)

WITH test AS (
    SELECT id, start_dt::date, end_dt::date, cat_id, sub_cat_id, is_active::int, comment  FROM ( VALUES 
        (1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
        (2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
        (3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
        (4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
        (5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
        (6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
        (7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
        (8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
        (9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
        (10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
        (11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
        (12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')
        ) test (id, start_dt, end_dt, cat_id, sub_cat_id, is_active, comment) 
    )
SELECT cat_id, start_date, end_date, active_state
FROM (
    SELECT cat_id, date as start_date, lead(date-1) over w as end_date
        , active_state, prev_active
        , nonactive_state, prev_nonactive
    FROM (
        SELECT cat_id, date 
            , active_state, prev_active
            , nonactive_state
            , lag(nonactive_state, 1, 0) over w as prev_nonactive
        FROM (
            SELECT cat_id, date, active_state, lag(active_state, 1, 0) over w as prev_active
                , (nonactive_state > active_state)::int as nonactive_state
            FROM (
                SELECT DISTINCT ON (cat_id, date)
                    cat_id, date
                    , (CASE WHEN sum(type) over w > 0 THEN 1 ELSE 0 END) as active_state
                    , (CASE WHEN sum(nonactive_type) over w > 0 THEN 1 ELSE 0 END) as nonactive_state
                FROM (
                    SELECT start_dt as date
                        , 1 as type
                        , cat_id
                        , 0 as nonactive_type
                    FROM test
                    WHERE is_active = 1
                  UNION ALL
                    SELECT end_dt + 1 as date
                        , -1 as type
                        , cat_id
                        , 0 as nonactive_type
                    FROM test
                    WHERE is_active = 1
                  UNION ALL
                    SELECT start_dt as date
                        , 0 as type
                        , cat_id
                        , 1 as nonactive_type
                    FROM test
                    WHERE is_active = 0
                  UNION ALL
                    SELECT end_dt + 1 as date
                        , 0 as type
                        , cat_id
                        , -1 as nonactive_type
                    FROM test
                    WHERE is_active = 0
                ) t
                WINDOW w as (partition by cat_id order by date)
                ORDER BY cat_id, date
            ) t2
            WINDOW w as (partition by cat_id order by date)
        ) t3
        WINDOW w as (partition by cat_id order by date)
    ) t4
    WHERE (active_state != prev_active) OR (nonactive_state != prev_nonactive)
    WINDOW w as (partition by cat_id order by date)
    ) t5
WHERE active_state = 1 OR nonactive_state = 1
ORDER BY cat_id, start_date

收益

| cat_id | start_date |   end_date | active_state |
|--------+------------+------------+--------------|
|      1 | 2018-01-01 | 2018-01-31 |            1 |
|      1 | 2018-02-01 | 2018-02-28 |            0 |
|      1 | 2018-03-01 | 2018-03-30 |            1 |
|      2 | 2018-01-01 | 2018-01-31 |            1 |
|      2 | 2018-02-01 | 2018-02-14 |            0 |
|      2 | 2018-02-15 | 2018-03-30 |            1 |
|      2 | 2018-04-01 | 2018-04-30 |            0 |

这将start_dtend_dt日期合并到一个列中,并且 引入了type列,起始日期为1,结束日期为-1。 在type上求和将产生一个正值,当 对应的date[start_dt, end_dt]间隔内,并且为0 否则。

这是Itzik Ben-Gan的Packing Intervals中提出的想法之一,但我首先 从DSM中学到的(在Python / Pandas编程环境中) here


通常,使用上述技术处理间隔时,间隔 定义日期何时处于“打开”状态,而不是“打开”自动表示“关闭”。 但是,在这个问题中 其中active_state = 1表示最后一个active_state处于“开”的行,但这些间隔之外的日期不一定是“关”。 2018-03-31是外部日期的示例 active_state = 1间隔,但不是“关闭”。 同样,只要日期不与active_state = 0的间隔相交,则active_state暗示最后active_state = 1的行是“ off”的。

为处理这两种不同的间隔,我两次应用了上述技术(求和+ 1 / -1 type):一次用于行is_active = 1,一次用于行{{1 }}。 这使我们可以确定绝对在is_active = 0中的日期(“开”)和绝对在active_state中的日期(“关”)。 由于有效王牌无效,因此可以使用以下方式修剪被认为无效的日期:

nonactive_state

(即,当(nonactive_state > active_state)::int as nonactive_state active_state = 1时,上面的分配用于将nonactive_state = 1更改为nonactive_state。)