为什么在greenplum中,分区表使用nestedloop join,而非分区表使用hash join

时间:2018-06-04 07:09:15

标签: greenplum

我创建了两个表(A,B),其中包含100列,相同的DDL,但B已分区

CREATE TABLE A (
  id integer, ......, col integer,
  CONSTRAINT A_pkey PRIMARY KEY (id))
WITH (OIDS = FALSE)
TABLESPACE pg_default
DISTRIBUTED BY (id);

CREATE TABLE B (
  id integer, ......, col integer,
  CONSTRAINT B_pkey PRIMARY KEY (id))
WITH (OIDS = FALSE)
TABLESPACE pg_default
DISTRIBUTED BY (id)
PARTITION BY RANGE(id) 
  (START (1) END (2100000) EVERY (500000), 
   DEFAULT PARTITION extra 
  );

并将相同的数据(2000000行)导入A和B.然后我分别用A和B执行sql:

UPDATE A a SET a.col = c.col from C c where c.id = a.id
UPDATE B b SET b.col = c.col from C c where c.id = b.id

结果,A在一分钟后成功但B花了很长时间,最后发生了内存错误:

ERROR:  Canceling query because of high VMEM usage.

所以我检查了两个sql的EXPLAIN,我发现A使用了哈希加​​入但是B使用了嵌套循环加入

分区表是否有使用嵌套循环连接的原因?在存储数百万个数据时,greenplum是否不必使用表分区?

1 个答案:

答案 0 :(得分:1)

您正在做一些不推荐的事情,这些事情可以解释为什么您会看到嵌套循环。

  1. 一般避免使用UPDATE语句。该行的旧版本仍保留在磁盘上以及该行的新版本。因此,如果更新整个表,则实际上将使用的磁盘上的物理大小加倍。
  2. 我从未见过用于分区表的堆表。你应该主要使用Greenplum中的Append Only表,特别是在较大的表上,例如分区表。
  3. 您正在通过分发键进行分区。这是不推荐的,根本没有益处。您是否计划按一系列ID进行过滤?这很不寻常。如果是这样,请将分配密钥更改为其他密钥。
  4. 我认为Pivotal禁用了在分区表上创建主键的功能。有一次,这是不允许的。我会阻止你创建任何主键,因为它只占用空间而优化器通常不会使用它。
  5. 修复这些项目后,我无法重现您的嵌套循环问题。我也在使用5.0.0版本。

        drop table if exists a;
        drop table if exists b;
        drop table if exists c;
        CREATE TABLE A 
        (id integer, col integer, mydate timestamp)
        WITH (appendonly=true)
        DISTRIBUTED BY (id);
    
        CREATE TABLE B 
        (id integer, col integer, mydate timestamp)
        WITH (appendonly=true)
        DISTRIBUTED BY (id)
        PARTITION BY RANGE(mydate) 
          (START ('2015-01-01'::timestamp) END ('2018-12-31'::timestamp) EVERY ('1 month'::interval), 
           DEFAULT PARTITION extra 
          );
    
        create table c
        (id integer, col integer, mydate timestamp)
        distributed by (id);
    
        insert into a
        select i, i+10, '2015-01-01'::timestamp + '1 day'::interval*i
        from generate_series(0, 2000) as i
        where '2015-01-01'::timestamp + '1 day'::interval*i < '2019-01-01'::timestamp;
    
        insert into b
        select i, i+10, '2015-01-01'::timestamp + '1 day'::interval*i
        from generate_series(0, 2000) as i
        where '2015-01-01'::timestamp + '1 day'::interval*i < '2019-01-01'::timestamp;
    
        insert into c
        select i, i+10, '2015-01-01'::timestamp + '1 day'::interval*i
        from generate_series(0, 2000) as i
        where '2015-01-01'::timestamp + '1 day'::interval*i < '2019-01-01'::timestamp;
    
    
        explain UPDATE A a SET col = c.col from C c where c.id = a.id;
        /*
        "Update  (cost=0.00..862.13 rows=1 width=1)"
        "  ->  Result  (cost=0.00..862.00 rows=1 width=34)"
        "        ->  Split  (cost=0.00..862.00 rows=1 width=30)"
        "              ->  Hash Join  (cost=0.00..862.00 rows=1 width=30)"
        "                    Hash Cond: public.a.id = c.id"
        "                    ->  Table Scan on a  (cost=0.00..431.00 rows=1 width=26)"
        "                    ->  Hash  (cost=431.00..431.00 rows=1 width=8)"
        "                          ->  Table Scan on c  (cost=0.00..431.00 rows=1 width=8)"
        "Settings:  optimizer_join_arity_for_associativity_commutativity=18"
        "Optimizer status: PQO version 2.42.0"
        */
    
        explain UPDATE B b SET col = c.col from C c where c.id = b.id;
        /*
        "Update  (cost=0.00..862.13 rows=1 width=1)"
        "  ->  Result  (cost=0.00..862.00 rows=1 width=34)"
        "        ->  Split  (cost=0.00..862.00 rows=1 width=30)"
        "              ->  Hash Join  (cost=0.00..862.00 rows=1 width=30)"
        "                    Hash Cond: public.a.id = c.id"
        "                    ->  Table Scan on a  (cost=0.00..431.00 rows=1 width=26)"
        "                    ->  Hash  (cost=431.00..431.00 rows=1 width=8)"
        "                          ->  Table Scan on c  (cost=0.00..431.00 rows=1 width=8)"
        "Settings:  optimizer_join_arity_for_associativity_commutativity=18"
        "Optimizer status: PQO version 2.42.0"
    
        */