我有一个1400万行表。我正在尝试获得一种“边缘列表”,在这里我可以从其他所有观察中观察扫描数据是否在增量时间内。尝试运行查询以获取1%的数据样本需要一个小时。 5%的情况花费了超过12个小时。任务的最小示例是:
原始表格
+----------+------------+------------+------------+
| ID | SCANDATE | SCANDATE+D | SCANDATE-D |
+----------+------------+------------+------------+
| A | 2018/08/03 | 2018/08/05 | 2018/08/01 |
| B | 2018/08/04 | 2018/08/06 | 2018/08/02 |
| C | 2018/08/11 | 2018/08/13 | 2018/08/09 |
+----------+------------+------------+------------+
结果:
+----------+------------+
| ID1 | ID2 |
+----------+------------+
| A | B |
+----------+------------+
我正在使用的代码是这样的:
CREATE TABLE edgelist_1 AS (SELECT * FROM
(SELECT scan_date, ship_date, serial_number,scan_date
+ interval '3' day AS buffup, scan_date - interval '3'
day AS bufflow FROM bag_list_1 ) AS ims
INNER JOIN
( SELECT
scan_date as buff, serial_number as SN FROM bag_list_1 )
AS X
ON ims.serial_number<>X.SN
WHERE ims.bufflow < X.buff AND ims.buffup > X.buff )
这是我得到的解释输出:
Gather (cost=1000.00..4596699470013.62 rows=21994769623450 width=62)
Workers Planned: 4
-> Nested Loop (cost=0.00..2397222506668.62 rows=5498692405862 width=62)
Join Filter: ((bag_list_1.serial_number <> bag_list_1_1.serial_number) AND ((bag_list_1.scan_date - '3 days'::interval day) < bag_list_1_1.scan_date) AND ((bag_list_1.scan_date + '3 days'::interval day) > bag_list_1_1.scan_date))
-> Parallel Seq Scan on bag_list_1 (cost=0.00..251629.94 rows=3517394 width=27)
-> Seq Scan on bag_list_1 bag_list_1_1 (cost=0.00..357151.75 rows=14069575 width=19)
什么是完成此任务的更有效方法。