Question

我有一个如下所示的查询：

select
   a.col1,
   a.col2,
   b.col3
from
   a 
   left join b on (a.id=b.id and b.attribute_id=3)
   left join c on (a.id=c.id and c.attribute_id=4)

即使将distkey设置为id，我也会在查询计划中获得一个DS_BCAST_INNER，最终只能获得100万行的非常查询时间。

Answer 1

将id设置为分发密钥应该共同定位数据并消除对广播的需要。

create table a (id int distkey, attribute_id int, col1 varchar(10), col2 varchar(10));
create table b (id int distkey, attribute_id int, col3 varchar(10));
create table c (id int distkey, attribute_id int);

您应该看到类似的解释计划：

admin@dev=# explain select
       a.col1,
       a.col2,
       b.col3
    from
       a 
       left join b on (a.id=b.id and b.attribute_id=3)
       left join c on (a.id=c.id and c.attribute_id=4);
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     XN Hash Left Join DS_DIST_NONE  (cost=0.09..0.23 rows=3 width=99)
       Hash Cond: ("outer".id = "inner".id)
       ->  XN Hash Left Join DS_DIST_NONE  (cost=0.05..0.14 rows=3 width=103)
             Hash Cond: ("outer".id = "inner".id)
             ->  XN Seq Scan on a  (cost=0.00..0.03 rows=3 width=70)
             ->  XN Hash  (cost=0.04..0.04 rows=3 width=37)
                   ->  XN Seq Scan on b  (cost=0.00..0.04 rows=3 width=37)
                         Filter: (attribute_id = 3)
       ->  XN Hash  (cost=0.04..0.04 rows=1 width=4)
             ->  XN Seq Scan on c  (cost=0.00..0.04 rows=1 width=4)
                   Filter: (attribute_id = 4)
    (11 rows)

    Time: 123.315 ms

如果表包含300万行或更少的行且写入频率较低，则使用DIST STYLE ALL应该是安全的。如果您确实使用DIST STYLE KEY，请验证分发表是否会导致行偏斜（请查看以下查询）：

select "schema", "table", skew_rows from svv_table_info;

“skew_rows”是具有最多和最少数据的切片之间的数据比率。它应该接近1.00。

我应该如何设置Redshift中具有条件的左连接的distkey？

1 个答案: