Question

尝试比较2个查询以了解哪个查询会更快。基本的想法是有一个没有重复的表（比如test1）。然后，您尝试仅插入第二个表中的增量（例如test2），如果第二个表有重复项，则只插入一个记录副本。

制备

create table test1  (id varchar(10), a bigint, b bigint);
create table test2  (id varchar(10), a bigint, b bigint);
insert into test1 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test2 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);

查询1：

INSERT INTO test2
SELECT DISTINCT id,
                a,
                b
FROM   test1
WHERE  NOT EXISTS (SELECT *
                   FROM   test2
                   WHERE  test2.id = test1.id);

查询2：

INSERT INTO test2
    SELECT id,
           a,
           b
    FROM   (SELECT t2.*
            FROM   (SELECT Row_number() OVER(partition BY id) AS dup_id,
                           *
                    FROM   test1) t2
            WHERE  t2.dup_id = 1) t1
    WHERE  t1.id NOT IN (SELECT test2.id
                         FROM   test2);

有人可以帮助我了解哪一个会更快更有效吗？

更新

解释第一个查询：

db=# explain insert into test2 select distinct id, a, b from test1 where not exists (select * from test2 where test2.id=test1.id);                                                            QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan "*SELECT*"  (cost=3613333.97..4213334.30 rows=7 width=49)
   ->  XN Unique  (cost=3613333.97..4213334.23 rows=7 width=49)
         ->  XN Hash Left Join DS_BCAST_INNER  (cost=3613333.97..4213334.18 rows=7 width=49)
               Hash Cond: ("outer".oid = "inner".oid)
               Filter: ("inner".oid IS NULL)
               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=53)
               ->  XN Hash  (cost=3613333.96..3613333.96 rows=5 width=4)
                     ->  XN Subquery Scan volt_dt_1  (cost=1760000.36..3613333.96 rows=5 width=4)
                           ->  XN Unique  (cost=1760000.36..3613333.91 rows=5 width=4)
                                 ->  XN Hash Join DS_DIST_BOTH  (cost=1760000.36..3613333.90 rows=5 width=4)
                                       Outer Dist Key: test1.id
                                       Inner Dist Key: volt_dt_2.id
                                       Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                       ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=37)
                                       ->  XN Hash  (cost=1760000.34..1760000.34 rows=5 width=33)
                                             ->  XN Subquery Scan volt_dt_2  (cost=1760000.29..1760000.34 rows=5 width=33)
                                                   ->  XN HashAggregate  (cost=1760000.29..1760000.29 rows=5 width=33)
                                                         ->  XN Hash Join DS_DIST_BOTH  (cost=0.06..1760000.27 rows=5 width=33)
                                                               Outer Dist Key: test1.id
                                                               Inner Dist Key: test2.id
                                                               Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                                               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=33)
                                                               ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
                                                                     ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(26 rows)

解释第二个查询

db=# explain insert into test2 select id, a, b from (select t2.* from ( select row_number() over(partition by id order by id) as dup_id, * from test1 ) t2 where t2.dup_id = 1 ) t1 where t1.id not in (select test2.id from test2); 
                                                                                     QUERY PLAN                                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Hash NOT IN Join DS_DIST_INNER  (cost=1000000000000.23..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=49)
   Inner Dist Key: db.test2.id
   Hash Cond: (("outer".id)::text = ("inner".id)::text)
   ->  XN Subquery Scan t2  (cost=1000000000000.17..1000000000000.36 rows=1 width=49)
         Filter: (dup_id = 1)
         ->  XN Window  (cost=1000000000000.17..1000000000000.27 rows=7 width=49)
               Partition: id
               Order: id
               ->  XN Sort  (cost=1000000000000.17..1000000000000.19 rows=7 width=49)
                     Sort Key: id
                     ->  XN Network  (cost=0.00..0.07 rows=7 width=49)
                           Distribute
                           ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=49)
   ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
         ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(17 rows)

Answer 1

我认为首先应该更快，尽管它需要test2(id)上的索引。

通常情况下，这些问题的答案是＆＃34;尝试使用您的数据和系统。。。让我们知道＆＃34;。但是，row_number()需要table1的完整扫描。您也可以同时在table2中进行索引查找 - 这是第一个版本。

Answer 2

众所周知，与SELECT，UPDATE和DELETE相比，INSERT总是很昂贵，因为INSERT没有WHERE子句，而且当表中有更多索引时也更贵。

这两个查询仍在执行INSERT。所以，我的上述观点无效。

在第一个查询中，DISTINCT肯定需要表扫描，而在WHERE子句而不是NOT EXISTS中，您可以使用NOT IN运算符检查ID

INSERT INTO test2
SELECT DISTINCT id，
一，
b
来自test1
WHERE id NOT IN（SELECT id FROM test2）;

正如大家都知道的那样，SELECT * FROM Table很昂贵，最好使用必需的属性。

在你的第二个查询中，正如Gordon所说，ROW_NUMBER也需要进行表扫描，你在该查询中使用许多派生表，所以我猜第一个查询优于第二个。< / LI>

哪两个查询具有更好的性能？

制备

查询1：

查询2：

更新

2 个答案: