Question

我在PostgreSQL数据库中有大约100万个这种格式的事件：

id        |   stream_id     |  timestamp
----------+-----------------+-----------------
1         |   7             |  ....
2         |   8             |  ....

大约有50,000个独特的流。

我需要查找任何两个事件之间的时间超过特定时间段的所有事件。换句话说，我需要找到在特定时间段内没有事件的事件对。

例如：

a b c d   e     f              g         h   i  j k
| | | |   |     |              |         |   |  | | 

                \____2 mins____/

在这种情况下，我想找到对（f，g），因为那些是紧邻间隙的事件。

我不在乎查询是否缓慢，即在100万条记录上，如果它需要一个小时左右就可以了。然而，数据集将继续增长，所以如果它的速度很慢，那么它很有希望。＃/ p>

我也有MongoDB中的数据。

执行此查询的最佳方法是什么？

Answer 1

在postgres中，借助lag（）窗口函数可以很容易地完成它。请查看下面的小提琴作为示例：

SQL Fiddle

PostgreSQL 9.3架构设置：

CREATE TABLE Table1
    ("id" int, "stream_id" int, "timestamp" timestamp)
;

INSERT INTO Table1
    ("id", "stream_id", "timestamp")
VALUES
    (1, 7, '2015-06-01 15:20:30'),
    (2, 7, '2015-06-01 15:20:31'),
    (3, 7, '2015-06-01 15:20:32'),
    (4, 7, '2015-06-01 15:25:30'),
    (5, 7, '2015-06-01 15:25:31')
;

查询1 ：

with c as (select *,
           lag("timestamp") over(partition by stream_id order by id) as pre_time,
           lag(id) over(partition by stream_id order by id) as pre_id
           from Table1
          )
select * from c where "timestamp" - pre_time > interval '2 sec'

<强> Results ：

| id | stream_id |              timestamp |               pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
|  4 |         7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 |      3 |

Answer 2

您可以通过时间戳排序的stream_id在分区上使用lag()窗口函数执行此操作。 SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id, ("timestamp" - lag("timestamp") OVER pair) AS diff FROM my_table WHERE diff > interval '2 minutes' WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");函数使您可以访问分区中的先前行;没有滞后值，它是前一行。因此，如果stream_id上的分区按时间排序，则前一行是该stream_id的上一个事件。

{{1}}

发现巨大事件流中的差距？

2 个答案: