大型数据集的自连接以获得时间间隔重叠

时间:2019-02-13 16:41:53

标签: postgresql self-join

我有一个1400万行表。我正在尝试获得一种“边缘列表”,在这里我可以从其他所有观察中观察扫描数据是否在增量时间内。尝试运行查询以获取1%的数据样本需要一个小时。 5%的情况花费了超过12个小时。任务的最小示例是:

原始表格

+----------+------------+------------+------------+
|    ID    |  SCANDATE  | SCANDATE+D | SCANDATE-D |
+----------+------------+------------+------------+
| A        | 2018/08/03 | 2018/08/05 | 2018/08/01 |
| B        | 2018/08/04 | 2018/08/06 | 2018/08/02 |
| C        | 2018/08/11 | 2018/08/13 | 2018/08/09 |
+----------+------------+------------+------------+

结果:

+----------+------------+
|    ID1   | ID2        |
+----------+------------+
|   A      |      B     | 
+----------+------------+

我正在使用的代码是这样的:

   CREATE TABLE edgelist_1 AS (SELECT * FROM 
       (SELECT scan_date,  ship_date,  serial_number,scan_date  
            + interval '3' day AS buffup, scan_date - interval '3'     
       day  AS bufflow FROM bag_list_1 ) AS ims
        INNER JOIN  
       ( SELECT
       scan_date as buff,  serial_number as SN FROM  bag_list_1 ) 
       AS X
       ON ims.serial_number<>X.SN
       WHERE ims.bufflow < X.buff AND ims.buffup  > X.buff )

这是我得到的解释输出:

 Gather  (cost=1000.00..4596699470013.62 rows=21994769623450 width=62)
   Workers Planned: 4
   ->  Nested Loop  (cost=0.00..2397222506668.62 rows=5498692405862 width=62)
         Join Filter: ((bag_list_1.serial_number <> bag_list_1_1.serial_number) AND ((bag_list_1.scan_date - '3 days'::interval day) < bag_list_1_1.scan_date) AND ((bag_list_1.scan_date + '3 days'::interval day) > bag_list_1_1.scan_date))
         ->  Parallel Seq Scan on bag_list_1  (cost=0.00..251629.94 rows=3517394 width=27)
         ->  Seq Scan on bag_list_1 bag_list_1_1  (cost=0.00..357151.75 rows=14069575 width=19)

什么是完成此任务的更有效方法。

0 个答案:

没有答案