在Postgresql中具有Hash Join的加速查询

时间:2018-01-22 16:14:03

标签: sql postgresql join optimization hash

我有以下查询:

SELECT
   Sum(fact_individual_re.quality_hours) AS C0,
   dim_gender.name AS C1,
   dim_date.year AS C2
FROM
   fact_individual_re
   INNER JOIN dim_date ON fact_individual_re.dim_date_id = dim_date.id
   INNER JOIN dim_gender ON fact_individual_re.dim_gender_id = dim_gender.id
GROUP BY dim_date.year,dim_gender.name
ORDER BY dim_date.year ASC,dim_gender.name ASC,Sum(fact_individual_re.quality_hours) ASC

在解释它的计划时,HASH JOIN花费了大部分时间。有没有办法最小化HASH JOIN的时间:

Sort  (cost=190370.50..190370.55 rows=20 width=18) (actual time=4005.152..4005.154 rows=20 loops=1)
   Sort Key: dim_date.year, dim_gender.name, (sum(fact_individual_re.quality_hours))
   Sort Method: quicksort  Memory: 26kB
   ->  Finalize GroupAggregate  (cost=190369.07..190370.07 rows=20 width=18) (actual time=4005.106..4005.135 rows=20 loops=1)
         Group Key: dim_date.year, dim_gender.name
         ->  Sort  (cost=190369.07..190369.27 rows=80 width=18) (actual time=4005.100..4005.103 rows=100 loops=1)
               Sort Key: dim_date.year, dim_gender.name
               Sort Method: quicksort  Memory: 32kB
               ->  Gather  (cost=190358.34..190366.54 rows=80 width=18) (actual time=4004.966..4005.020 rows=100 loops=1)
                     Workers Planned: 4
                     Workers Launched: 4
                     ->  Partial HashAggregate  (cost=189358.34..189358.54 rows=20 width=18) (actual time=3885.254..3885.259 rows=20 loops=5)
                           Group Key: dim_date.year, dim_gender.name
                           ->  Hash Join  (cost=125.17..170608.34 rows=2500000 width=14) (actual time=2.279..2865.808 rows=2000000 loops=5)
                                 Hash Cond: (fact_individual_re.dim_gender_id = dim_gender.id)
                                 ->  Hash Join  (cost=124.13..150138.54 rows=2500000 width=12) (actual time=2.060..2115.234 rows=2000000 loops=5)
                                       Hash Cond: (fact_individual_re.dim_date_id = dim_date.id)
                                       ->  Parallel Seq Scan on fact_individual_re  (cost=0.00..118458.00 rows=2500000 width=12) (actual time=0.204..982.810 rows=2000000 loops=5)
                                       ->  Hash  (cost=78.50..78.50 rows=3650 width=8) (actual time=1.824..1.824 rows=3650 loops=5)
                                             Buckets: 4096  Batches: 1  Memory Usage: 175kB
                                             ->  Seq Scan on dim_date  (cost=0.00..78.50 rows=3650 width=8) (actual time=0.143..1.030 rows=3650 loops=5)
                                 ->  Hash  (cost=1.02..1.02 rows=2 width=10) (actual time=0.193..0.193 rows=2 loops=5)
                                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                       ->  Seq Scan on dim_gender  (cost=0.00..1.02 rows=2 width=10) (actual time=0.181..0.182 rows=2 loops=5)
 Planning time: 0.609 ms
 Execution time: 4020.423 ms
(26 rows)

我正在使用postgresql v10。

2 个答案:

答案 0 :(得分:3)

我建议在联接之前对行进行部分分组:

select
  sum(quality_hours_sum) AS C0,
  dim_gender.name AS C1,
  dim_date.year AS C2
from 
  (
    select
      sum(quality_hours) as quality_hours_sum,
      dim_date_id,
      dim_gender_id
    from fact_individual_re
    group by dim_date_id, dim_gender_id
  ) as fact_individual_re_sum
  join dim_date on dim_date_id = dim_date.id
  join dim_gender on dim_gender_id = dim_gender.id
group by dim_date.year, dim_gender.name
order by dim_date.year, dim_gender.name, 0;

这样,您将只加入1460行(count(distinct dim_date_id)*count(distint dim_gender_id))而不是所有2M行。虽然它仍然需要读取和分组所有2M行 - 以避免你需要像触发器一样维护汇总表。

答案 1 :(得分:0)

事实表中没有显示谓词,因此我们可以假设在通过连接进行过滤之前,需要100%的表。

索引存在于查找表中,但不包含您所说的索引。给定100%的事实表被扫描,结合索引没有覆盖,我希望它能够哈希加入。

作为一项实验,您可以应用覆盖索引(在单个索引中索引dim_date.date_id和dim_date.year)以查看它是否与dim_date交换散列连接。

虽然总体上缺少谓词 - 在覆盖索引之外,散列连接不一定是错误的查询计划。