将每一行与所有其他行进行比较

时间:2018-01-27 05:37:32

标签: postgresql amazon-redshift

假设我有一个这样的表:

 CREATE TABLE events (
        event_id INTEGER,
        begin_date DATE,
        end_date DATE,
PRIMARY KEY (event_id));

使用这样的数据:

INSERT INTO events SELECT 1 AS event_id,'2017-01-01'::DATE AS begin_date, '2017-01-07'::DATE AS end_date;
INSERT INTO events SELECT 2 AS event_id,'2017-01-04'::DATE AS begin_date, '2017-01-05'::DATE AS end_date;
INSERT INTO events SELECT 3 AS event_id,'2017-01-02'::DATE AS begin_date, '2017-01-03'::DATE AS end_date;
INSERT INTO events SELECT 4 AS event_id,'2017-01-03'::DATE AS begin_date, '2017-01-08'::DATE AS end_date;
INSERT INTO events SELECT 5 AS event_id,'2017-01-02'::DATE AS begin_date, '2017-01-09'::DATE AS end_date;
INSERT INTO events SELECT 6 AS event_id,'2017-01-03'::DATE AS begin_date, '2017-01-06'::DATE AS end_date;
INSERT INTO events SELECT 7 AS event_id,'2017-01-08'::DATE AS begin_date, '2017-01-09'::DATE AS end_date;

我希望能够做到这一点:

SELECT a.event_id, 
       COUNT (*) AS COUNT
  FROM events AS a
  LEFT JOIN events AS b
       ON a.begin_date < b.begin_date
          AND a.end_date > b.end_date
 GROUP BY a.event_id
 ORDER BY a.event_id ASC

结果如下:

*----------*--------*
| event_id |  count |
*-------------------*
|    1     |   3    |   
|    2     |   1    |   
|    3     |   1    |   
|    4     |   1    |   
|    5     |   3    |   
|    6     |   1    |   
|    7     |   1    |   
*----------*------- *

但是有一个窗口函数(因为它比不等式连接快得多)。这样的东西,我可以比较外行和内行。

SELECT a.event_id, 
       COUNT(*) OVER (a.begin_date < b.begin_date AND a.end_date > b.end_date) AS count
  FROM events AS a
 ORDER BY a.event_id ASC

理想情况下,这适用于Postgres和Redshift。

1 个答案:

答案 0 :(得分:1)

我认为你不会在这里使用JOIN。甚至窗口函数的想法也需要一种隐式连接,因为被比较的窗口必须是所有其他记录的集合。

相反,请考虑使用Postgres&#39;日期范围类型。您可以将开始日期和结束日期转换为范围,并检查左表中的排他范围是否包含右表的包含范围:

SELECT t1.event_id, count(*) 
FROM events t1 
   LEFT OUTER JOIN events t2 
       ON daterange(t1.begin_date, t1.end_date, '()') @> daterange(t2.begin_date, t2.end_date, '[]') AND t1.event_id <> t2.event_id 
GROUP BY 1 
ORDER BY 1;


 event_id | count
----------+-------
        1 |     3
        2 |     1
        3 |     1
        4 |     1
        5 |     3
        6 |     1
        7 |     1

真正的问题(我不知道答案)是,如果所有这些演员和&#34;范围包括&#34; @>逻辑比你的不等版本更有效。虽然此处使用的Nested Loop Left Join估算的行数较低,但Total Cost的排名却高出约33%。

如果您的数据存储为包含日期范围类型,那么我有预感,那么成本将会降低,因为演员不需要发生(尽管因为我们正在将包容范围与排他范围进行比较然后它可能是一个洗涤。)