折叠具有连续时间戳的多个行

时间:2012-06-25 09:22:02

标签: sql postgresql

我有一个代表产品用法的表,有点像日志。产品使用情况记录为多个时间戳,我想使用时间范围表示相同的数据。

看起来像这样(PostgreSQL 9.1):

userid | timestamp          | product
-------------------------------------
001    | 2012-04-23 9:12:05 | foo
001    | 2012-04-23 9:12:07 | foo
001    | 2012-04-23 9:12:09 | foo
001    | 2012-04-23 9:12:11 | barbaz
001    | 2012-04-23 9:12:13 | barbaz
001    | 2012-04-23 9:15:00 | barbaz
001    | 2012-04-23 9:15:01 | barbaz
002    | 2012-04-24 3:41:01 | foo
002    | 2012-04-24 3:41:03 | foo

我想要折叠与前一次运行的时差小于 delta 的行(例如: 2秒),并获取开始时间和结束时间,像这样:

userid | begin              | end                | product
----------------------------------------------------------
001    | 2012-04-23 9:12:05 | 2012-04-23 9:12:09 | foo
001    | 2012-04-23 9:12:11 | 2012-04-23 9:12:13 | barbaz
001    | 2012-04-23 9:15:00 | 2012-04-23 9:15:01 | barbaz
002    | 2012-04-24 3:41:01 | 2012-04-24 3:41:03 | foo

请注意,如果相同产品的使用量超过 delta (在此示例中为2秒),则会将其分为两行。

create table t (userid int, timestamp timestamp, product text);

insert into t (userid, timestamp, product) values 
(001, '2012-04-23 9:12:05', 'foo'),
(001, '2012-04-23 9:12:07', 'foo'),
(001, '2012-04-23 9:12:09', 'foo'),
(001, '2012-04-23 9:12:11', 'barbaz'),
(001, '2012-04-23 9:12:13', 'barbaz'),
(001, '2012-04-23 9:15:00', 'barbaz'),
(001, '2012-04-23 9:15:01', 'barbaz'),
(002, '2012-04-24 3:41:01', 'foo'),
(002, '2012-04-24 3:41:03', 'foo')
;

1 个答案:

答案 0 :(得分:9)

this answer的启发,暂时退出@a_horse_with_no_name。

WITH groupped_t AS (
SELECT *, sum(grp_id) OVER (ORDER BY userid,product,"timestamp") AS grp_nr
  FROM (SELECT t.*,
          lag("timestamp") OVER
           (PARTITION BY userid,product ORDER BY "timestamp") AS prev_ts,
          CASE WHEN ("timestamp" - lag("timestamp") OVER
            (PARTITION BY userid,product ORDER BY "timestamp")) <= '2s'::interval
          THEN NULL ELSE 1 END AS grp_id
        FROM t) AS g
), periods AS (
SELECT min(gt."timestamp") AS grp_min, max(gt."timestamp") AS grp_max, grp_nr
  FROM groupped_t AS gt
 GROUP BY gt.grp_nr
)
SELECT gt.userid, p.grp_min AS "begin", p.grp_max AS "end", gt.product
  FROM periods p
  JOIN groupped_t gt ON gt.grp_nr = p.grp_nr AND gt."timestamp" = p.grp_min
 ORDER BY gt.userid, p.grp_min;
  1. 最里面的查询将根据useridproduct和时差分配灌浆ID。我认为事实上PARTITION BY前两个字段应该是安全的。
  2. groupped_t为我提供了所有源列+额外运行的组号。我在ORDER BY窗口函数中仅使用sum(),因为我需要组ID是唯一的。
  3. periods只是对每个组中第一个和最后一个时间戳的帮助查询。
  4. 最后,我groupped_tperiods grp_nr(这就是为什么我需要它是唯一的)以及每个组中第一个条目的时间戳。
  5. 您还可以在SQL Fiddle上查看此查询。

    请注意,timestampbeginendreserved words in the SQLend也适用于PostgreSQL),因此您应该避免或双引号