我有一个代表产品用法的表,有点像日志。产品使用情况记录为多个时间戳,我想使用时间范围表示相同的数据。
看起来像这样(PostgreSQL 9.1):
userid | timestamp | product
-------------------------------------
001 | 2012-04-23 9:12:05 | foo
001 | 2012-04-23 9:12:07 | foo
001 | 2012-04-23 9:12:09 | foo
001 | 2012-04-23 9:12:11 | barbaz
001 | 2012-04-23 9:12:13 | barbaz
001 | 2012-04-23 9:15:00 | barbaz
001 | 2012-04-23 9:15:01 | barbaz
002 | 2012-04-24 3:41:01 | foo
002 | 2012-04-24 3:41:03 | foo
我想要折叠与前一次运行的时差小于 delta 的行(例如: 2秒),并获取开始时间和结束时间,像这样:
userid | begin | end | product
----------------------------------------------------------
001 | 2012-04-23 9:12:05 | 2012-04-23 9:12:09 | foo
001 | 2012-04-23 9:12:11 | 2012-04-23 9:12:13 | barbaz
001 | 2012-04-23 9:15:00 | 2012-04-23 9:15:01 | barbaz
002 | 2012-04-24 3:41:01 | 2012-04-24 3:41:03 | foo
请注意,如果相同产品的使用量超过 delta (在此示例中为2秒),则会将其分为两行。
create table t (userid int, timestamp timestamp, product text);
insert into t (userid, timestamp, product) values
(001, '2012-04-23 9:12:05', 'foo'),
(001, '2012-04-23 9:12:07', 'foo'),
(001, '2012-04-23 9:12:09', 'foo'),
(001, '2012-04-23 9:12:11', 'barbaz'),
(001, '2012-04-23 9:12:13', 'barbaz'),
(001, '2012-04-23 9:15:00', 'barbaz'),
(001, '2012-04-23 9:15:01', 'barbaz'),
(002, '2012-04-24 3:41:01', 'foo'),
(002, '2012-04-24 3:41:03', 'foo')
;
答案 0 :(得分:9)
受this answer的启发,暂时退出@a_horse_with_no_name。
WITH groupped_t AS (
SELECT *, sum(grp_id) OVER (ORDER BY userid,product,"timestamp") AS grp_nr
FROM (SELECT t.*,
lag("timestamp") OVER
(PARTITION BY userid,product ORDER BY "timestamp") AS prev_ts,
CASE WHEN ("timestamp" - lag("timestamp") OVER
(PARTITION BY userid,product ORDER BY "timestamp")) <= '2s'::interval
THEN NULL ELSE 1 END AS grp_id
FROM t) AS g
), periods AS (
SELECT min(gt."timestamp") AS grp_min, max(gt."timestamp") AS grp_max, grp_nr
FROM groupped_t AS gt
GROUP BY gt.grp_nr
)
SELECT gt.userid, p.grp_min AS "begin", p.grp_max AS "end", gt.product
FROM periods p
JOIN groupped_t gt ON gt.grp_nr = p.grp_nr AND gt."timestamp" = p.grp_min
ORDER BY gt.userid, p.grp_min;
userid
,product
和时差分配灌浆ID。我认为事实上PARTITION BY
前两个字段应该是安全的。groupped_t
为我提供了所有源列+额外运行的组号。我在ORDER BY
窗口函数中仅使用sum()
,因为我需要组ID是唯一的。periods
只是对每个组中第一个和最后一个时间戳的帮助查询。groupped_t
加periods
grp_nr
(这就是为什么我需要它是唯一的)以及每个组中第一个条目的时间戳。您还可以在SQL Fiddle上查看此查询。
请注意,timestamp
,begin
和end
为reserved words in the SQL(end
也适用于PostgreSQL),因此您应该避免或双引号