我有以下查询:
SELECT
Sum(fact_individual_re.quality_hours) AS C0,
dim_gender.name AS C1,
dim_date.year AS C2
FROM
fact_individual_re
INNER JOIN dim_date ON fact_individual_re.dim_date_id = dim_date.id
INNER JOIN dim_gender ON fact_individual_re.dim_gender_id = dim_gender.id
GROUP BY dim_date.year,dim_gender.name
ORDER BY dim_date.year ASC,dim_gender.name ASC,Sum(fact_individual_re.quality_hours) ASC
在解释它的计划时,HASH JOIN花费了大部分时间。有没有办法最小化HASH JOIN的时间:
Sort (cost=190370.50..190370.55 rows=20 width=18) (actual time=4005.152..4005.154 rows=20 loops=1)
Sort Key: dim_date.year, dim_gender.name, (sum(fact_individual_re.quality_hours))
Sort Method: quicksort Memory: 26kB
-> Finalize GroupAggregate (cost=190369.07..190370.07 rows=20 width=18) (actual time=4005.106..4005.135 rows=20 loops=1)
Group Key: dim_date.year, dim_gender.name
-> Sort (cost=190369.07..190369.27 rows=80 width=18) (actual time=4005.100..4005.103 rows=100 loops=1)
Sort Key: dim_date.year, dim_gender.name
Sort Method: quicksort Memory: 32kB
-> Gather (cost=190358.34..190366.54 rows=80 width=18) (actual time=4004.966..4005.020 rows=100 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial HashAggregate (cost=189358.34..189358.54 rows=20 width=18) (actual time=3885.254..3885.259 rows=20 loops=5)
Group Key: dim_date.year, dim_gender.name
-> Hash Join (cost=125.17..170608.34 rows=2500000 width=14) (actual time=2.279..2865.808 rows=2000000 loops=5)
Hash Cond: (fact_individual_re.dim_gender_id = dim_gender.id)
-> Hash Join (cost=124.13..150138.54 rows=2500000 width=12) (actual time=2.060..2115.234 rows=2000000 loops=5)
Hash Cond: (fact_individual_re.dim_date_id = dim_date.id)
-> Parallel Seq Scan on fact_individual_re (cost=0.00..118458.00 rows=2500000 width=12) (actual time=0.204..982.810 rows=2000000 loops=5)
-> Hash (cost=78.50..78.50 rows=3650 width=8) (actual time=1.824..1.824 rows=3650 loops=5)
Buckets: 4096 Batches: 1 Memory Usage: 175kB
-> Seq Scan on dim_date (cost=0.00..78.50 rows=3650 width=8) (actual time=0.143..1.030 rows=3650 loops=5)
-> Hash (cost=1.02..1.02 rows=2 width=10) (actual time=0.193..0.193 rows=2 loops=5)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dim_gender (cost=0.00..1.02 rows=2 width=10) (actual time=0.181..0.182 rows=2 loops=5)
Planning time: 0.609 ms
Execution time: 4020.423 ms
(26 rows)
我正在使用postgresql v10。
答案 0 :(得分:3)
我建议在联接之前对行进行部分分组:
select
sum(quality_hours_sum) AS C0,
dim_gender.name AS C1,
dim_date.year AS C2
from
(
select
sum(quality_hours) as quality_hours_sum,
dim_date_id,
dim_gender_id
from fact_individual_re
group by dim_date_id, dim_gender_id
) as fact_individual_re_sum
join dim_date on dim_date_id = dim_date.id
join dim_gender on dim_gender_id = dim_gender.id
group by dim_date.year, dim_gender.name
order by dim_date.year, dim_gender.name, 0;
这样,您将只加入1460行(count(distinct dim_date_id)*count(distint dim_gender_id)
)而不是所有2M行。虽然它仍然需要读取和分组所有2M行 - 以避免你需要像触发器一样维护汇总表。
答案 1 :(得分:0)
事实表中没有显示谓词,因此我们可以假设在通过连接进行过滤之前,需要100%的表。
索引存在于查找表中,但不包含您所说的索引。给定100%的事实表被扫描,结合索引没有覆盖,我希望它能够哈希加入。
作为一项实验,您可以应用覆盖索引(在单个索引中索引dim_date.date_id和dim_date.year)以查看它是否与dim_date交换散列连接。
虽然总体上缺少谓词 - 在覆盖索引之外,散列连接不一定是错误的查询计划。