SQL中分组时间序列的最小值和最大值

时间:2018-09-20 06:29:52

标签: sql postgresql window-functions

我有一个大的Postgres表test,我想从中提取每个no_signal的{​​{1}}个状态的连续序列,换句话说就是长度随着时间的流逝,个别移动设备将停止服务。

在真实表中,记录没有顺序,我认为这意味着除了窗口函数外,还必须包含mobile_id)语句。任何关于如何为单个连续序列创建一个组,然后取每个组的最小值和最大值的建议,将不胜感激。

PARTITION OVER (time, mobile_id

我想要的输出将是这样的:

-- CREATE TABLE test (mobile_id int, state varchar, time timestamp, region varchar)

INSERT INTO test (mobile_id, state, time, region ) VALUES
(1, 'active', TIMESTAMP '2018-08-09 15:00:00', 'EU'),  
(1, 'active', TIMESTAMP '2018-08-09 16:00:00', 'EU'),
(1, 'no_signal', TIMESTAMP '2018-08-09 17:00:00', 'EU'),
(1, 'no_signal', TIMESTAMP '2018-08-09 18:00:00', 'EU'),
(1, 'no_signal', TIMESTAMP '2018-08-09 19:00:00', 'EU'),
(1, 'active', TIMESTAMP '2018-08-09 20:00:00', 'EU'),
(1, 'inactive', TIMESTAMP '2018-08-09 21:00:00', 'EU'),
(1, 'active', TIMESTAMP '2018-08-09 22:00:00', 'EU'),
(1, 'active', TIMESTAMP '2018-08-09 23:00:00', 'EU'),
(2, 'active', TIMESTAMP '2018-08-10 00:00:00', 'EU'),
(2, 'no_signal', TIMESTAMP '2018-08-10 01:00:00', 'EU'),
(2, 'active', TIMESTAMP '2018-08-10 02:00:00', 'EU'),
(2, 'no_signal', TIMESTAMP '2018-08-10 03:00:00', 'EU'),
(2, 'no_signal', TIMESTAMP '2018-08-10 04:00:00', 'EU'),
(2, 'no_signal', TIMESTAMP '2018-08-10 05:00:00', 'EU'),
(2, 'no_signal', TIMESTAMP '2018-08-10 06:00:00', 'EU'),
(3, 'active', TIMESTAMP '2018-08-10 07:00:00', 'SA'),
(3, 'active', TIMESTAMP '2018-08-10 08:00:00', 'SA'),
(3, 'no_signal', TIMESTAMP '2018-08-10 09:00:00', 'SA'),
(3, 'no_signal', TIMESTAMP '2018-08-10 10:00:00', 'SA'),
(3, 'inactive', TIMESTAMP '2018-08-10 11:00:00', 'SA'),
(3, 'inactive', TIMESTAMP '2018-08-10 12:00:00', 'SA'),
(3, 'no_signal', TIMESTAMP '2018-08-10 13:00:00', 'SA')

由于未正确创建组,因此以下代码无法产生所需的结果:

 mobile_id          start_time            end_time diff_time region
         1 2018-08-09 17:00:00 2018-08-09 19:00:00       120     EU
         2 2018-08-10 01:00:00 2018-08-10 01:00:00         0     EU
         2 2018-08-10 03:00:00 2018-08-10 06:00:00       180     EU
         3 2018-08-10 09:00:00 2018-08-10 10:00:00        60     SA
         3 2018-08-10 13:00:00 2018-08-10 13:00:00         0     SA

2 个答案:

答案 0 :(得分:1)

demo: db<>fiddle

SELECT DISTINCT
    mobile_id,
    first_value(time) over (partition by ranked, time) as start_time,        -- B
    first_value(time) over (partition by ranked, time desc) as end_time, 
    region
FROM
(
    SELECT *, SUM(is_diff) OVER (ORDER BY time) as ranked                          -- A
    FROM
    (
        SELECT *,
            CASE WHEN state = lag(state) over (order by time) THEN 0 ELSE 1 END as is_diff
        FROM test 
    ) s
) s
WHERE
    state = 'no_signal';

A:问题是您试图排序一列,然后又想为另一个分区。此子查询可以解决此问题。在here中讨论了该问题。我正在寻找更好的解决方案,但此子查询有效。这将创建一个可用于所需窗口的列。

B:创建窗口后,可以使用start_timeend_time函数轻松计算first_value(time)first_value(time) ... ORDER BY time DESCDESC,因为它会以最新时间对窗口进行排序,然后您可以获取该窗口的第一个值(last_value() does not work as expected every time)。


为了更清楚地了解实际问题,我在上面省略了diff计算:要添加diff,您只需要执行一个子查询:

SELECT 
    *,  
    EXTRACT(epoch from (end_time - start_time)) / 60 as diff
FROM (
    -- <QUERY ABOVE>
) s

答案 1 :(得分:1)

这是间隙和孤岛问题的一种变体。在这种情况下,您尝试检测每个移动电话号码具有no_signal的多个记录孤岛。

此答案使用“行数差异方法”。技巧与通过两种方式在表上应用ROW_NUMBER有关。第一个为所有记录按时间顺序生成序列,第二个为每个mobile_id组生成序列,然后仅为状态为no_signal的那些记录生成序列。这些行号值中的差异可用于形成每个岛。然后,我们只需要合计并获取最小/最大时间戳值即可获得所需的结果。

WITH cte1 AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY time) rn1
    FROM test
),
cte2 AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY mobile_id ORDER BY time) rn2
    FROM test
    WHERE state = 'no_signal'
),
cte3 AS (
    SELECT t1.*, t2.rn2
    FROM cte1 t1
    LEFT JOIN cte2 t2
        ON t1.mobile_id = t2.mobile_id AND t1.time = t2.time
    WHERE t1.state = 'no_signal'
)

SELECT
    mobile_id,
    MIN(time) AS start_time,
    MAX(time) AS end_time,
    EXTRACT(epoch FROM MAX(time::timestamp) - MIN(time::timestamp)) / 60 diff_time,
    region
FROM cte3
GROUP BY
    mobile_id,
    region,
    (rn1 - rn2)
ORDER BY
    mobile_id,
    start_time;

enter image description here

Demo