哪两个查询具有更好的性能?

时间:2016-07-08 20:42:04

标签: sql performance amazon-redshift

尝试比较2个查询以了解哪个查询会更快。基本的想法是有一个没有重复的表(比如test1)。然后,您尝试仅插入第二个表中的增量(例如test2),如果第二个表有重复项,则只插入一个记录副本。

制备

create table test1  (id varchar(10), a bigint, b bigint);
create table test2  (id varchar(10), a bigint, b bigint);
insert into test1 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test2 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);

查询1:

INSERT INTO test2
SELECT DISTINCT id,
                a,
                b
FROM   test1
WHERE  NOT EXISTS (SELECT *
                   FROM   test2
                   WHERE  test2.id = test1.id);

查询2:

INSERT INTO test2
    SELECT id,
           a,
           b
    FROM   (SELECT t2.*
            FROM   (SELECT Row_number() OVER(partition BY id) AS dup_id,
                           *
                    FROM   test1) t2
            WHERE  t2.dup_id = 1) t1
    WHERE  t1.id NOT IN (SELECT test2.id
                         FROM   test2);

有人可以帮助我了解哪一个会更快更有效吗?

更新

解释第一个查询:
db=# explain insert into test2 select distinct id, a, b from test1 where not exists (select * from test2 where test2.id=test1.id);                                                            QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan "*SELECT*"  (cost=3613333.97..4213334.30 rows=7 width=49)
   ->  XN Unique  (cost=3613333.97..4213334.23 rows=7 width=49)
         ->  XN Hash Left Join DS_BCAST_INNER  (cost=3613333.97..4213334.18 rows=7 width=49)
               Hash Cond: ("outer".oid = "inner".oid)
               Filter: ("inner".oid IS NULL)
               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=53)
               ->  XN Hash  (cost=3613333.96..3613333.96 rows=5 width=4)
                     ->  XN Subquery Scan volt_dt_1  (cost=1760000.36..3613333.96 rows=5 width=4)
                           ->  XN Unique  (cost=1760000.36..3613333.91 rows=5 width=4)
                                 ->  XN Hash Join DS_DIST_BOTH  (cost=1760000.36..3613333.90 rows=5 width=4)
                                       Outer Dist Key: test1.id
                                       Inner Dist Key: volt_dt_2.id
                                       Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                       ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=37)
                                       ->  XN Hash  (cost=1760000.34..1760000.34 rows=5 width=33)
                                             ->  XN Subquery Scan volt_dt_2  (cost=1760000.29..1760000.34 rows=5 width=33)
                                                   ->  XN HashAggregate  (cost=1760000.29..1760000.29 rows=5 width=33)
                                                         ->  XN Hash Join DS_DIST_BOTH  (cost=0.06..1760000.27 rows=5 width=33)
                                                               Outer Dist Key: test1.id
                                                               Inner Dist Key: test2.id
                                                               Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                                               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=33)
                                                               ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
                                                                     ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(26 rows)
解释第二个查询
db=# explain insert into test2 select id, a, b from (select t2.* from ( select row_number() over(partition by id order by id) as dup_id, * from test1 ) t2 where t2.dup_id = 1 ) t1 where t1.id not in (select test2.id from test2); 
                                                                                     QUERY PLAN                                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Hash NOT IN Join DS_DIST_INNER  (cost=1000000000000.23..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=49)
   Inner Dist Key: db.test2.id
   Hash Cond: (("outer".id)::text = ("inner".id)::text)
   ->  XN Subquery Scan t2  (cost=1000000000000.17..1000000000000.36 rows=1 width=49)
         Filter: (dup_id = 1)
         ->  XN Window  (cost=1000000000000.17..1000000000000.27 rows=7 width=49)
               Partition: id
               Order: id
               ->  XN Sort  (cost=1000000000000.17..1000000000000.19 rows=7 width=49)
                     Sort Key: id
                     ->  XN Network  (cost=0.00..0.07 rows=7 width=49)
                           Distribute
                           ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=49)
   ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
         ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(17 rows)

2 个答案:

答案 0 :(得分:2)

我认为首先应该更快,尽管它需要test2(id)上的索引。

通常情况下,这些问题的答案是"尝试使用您的数据和系统。 。 。让我们知道"。但是,row_number()需要table1的完整扫描。您也可以同时在table2中进行索引查找 - 这是第一个版本。

答案 1 :(得分:1)

众所周知,与SELECT,UPDATE和DELETE相比,INSERT总是很昂贵,因为INSERT没有WHERE子句,而且当表中有更多索引时也更贵。

这两个查询仍在执行INSERT。所以,我的上述观点无效。

  1. 在第一个查询中,DISTINCT肯定需要表扫描,而在WHERE子句而不是NOT EXISTS中,您可以使用NOT IN运算符检查ID
  2. INSERT INTO test2
    SELECT DISTINCT id,
                    一,
                    b
    来自test1
    WHERE id NOT IN(SELECT id FROM test2);

    正如大家都知道的那样,SELECT * FROM Table很昂贵,最好使用必需的属性。

    1. 在你的第二个查询中,正如Gordon所说,ROW_NUMBER也需要进行表扫描,你在该查询中使用许多派生表,所以我猜第一个查询优于第二个。< / LI>