伙计们,我有这个查询来填充数据
insert into cem_summary_hourly_dly_sum_fct_20180805_x
select
trx_dt_sk_id,
month_sk_id,
sbscrptn_ek_id,
msisdn,
hours,
string_agg((case when res.seqnum <= 5 then category_name end), '|') as category_name,
string_agg((case when res.seqnum <= 5 then application_name end), '|') as application_name,
string_agg((case when res.seqnum <= 5 then browser_name end), '|') as browser_name,
string_agg((case when res.seqnum <= 5 then os_name end), '|') as os_name,
string_agg((case when res.seqnum <= 5 then volume_total_mb::character varying end), '|') as volume_mb_split,
string_agg((case when res.seqnum <= 5 then activity_sec::character varying end), '|') as active_sec_split,
sum(coalesce(volume_in, 0)) as volume_in,
sum(coalesce(volume_out, 0)) as volume_out,
sum(coalesce(volume_total_mb, 0)) as volume_total_mb,
sum(coalesce(activity_sec, 0)) as activity_sec
from (
select
trx_dt_sk_id,
month_sk_id,
sbscrptn_ek_id,
msisdn,
hours,
category_name,
application_name,
browser_name,
os_name,
rank() over (partition by hours order by sum(volume_total_mb) desc) as seqnum,
sum(coalesce(volume_in, 0)) as volume_in,
sum(coalesce(volume_out, 0)) as volume_out,
sum(coalesce(volume_total_mb, 0)) as volume_total_mb,
sum(coalesce(activity_sec, 0)) as activity_sec
from dwh.cem_summary_hourly_dly_fct_1_prt_20180805 src
group by 1,2,3,4,5,6,7,8,9 ) res
group by 1,2,3,4,5;
所以基本上我对子查询进行分组并给它们排名。然后当等级为<= 5(前5个)时,我检索结果。此查询有效,但是需要很长时间(此表的一天数据约为30亿个数据),可能需要一个多小时。
我使用其他方法(在临时表上创建分组,然后检索结果),但是没有任何区别。
有什么建议可以使查询运行更快?
这是说明查询结果
"Gather Motion 64:1 (slice2; segments: 64) (cost=1902258144.94..1916188904.50 rows=30284260 width=492)"
" -> GroupAggregate (cost=1902258144.94..1916188904.50 rows=473192 width=492)"
" Group By: res.trx_dt_sk_id, res.month_sk_id, res.sbscrptn_ek_id, res.msisdn, res.hours"
" -> Sort (cost=1902258144.94..1903015251.44 rows=4731916 width=1180)"
" Sort Key: res.trx_dt_sk_id, res.month_sk_id, res.sbscrptn_ek_id, res.msisdn, res.hours"
" -> Subquery Scan res (cost=1279332227.27..1284631972.75 rows=4731916 width=1180)"
" -> Window (cost=1279332227.27..1281603546.76 rows=4731916 width=1204)"
" Partition By: src.hours"
" Order By: (sum(src.volume_total_mb))"
" -> Sort (cost=1279332227.27..1280089333.76 rows=4731916 width=1204)"
" Sort Key: src.hours, (sum(src.volume_total_mb))"
" -> Redistribute Motion 64:64 (slice1; segments: 64) (cost=391004281.95..650282943.07 rows=4731916 width=1204)"
" Hash Key: src.hours"
" -> HashAggregate (cost=391004281.95..641197665.10 rows=4731916 width=1204)"
" Group By: src.trx_dt_sk_id, src.month_sk_id, src.sbscrptn_ek_id, src.msisdn, src.hours, src.category_name, src.application_name, src.browser_name, src.os_name"
" -> Append-only Columnar Scan on cem_summary_hourly_dly_fct_1_prt_20180805 src (cost=0.00..41629947.84 rows=47319156 width=104)"
"Settings: optimizer=off"
"Optimizer status: legacy query optimizer"
一些信息:表大小约为60GB。当前的过程耗时超过一小时,我们预计将在15分钟左右完成。
答案 0 :(得分:-1)
我无法给您确切的答案,所以我会给您一些提示。
首先建立查询步骤,并在每个步骤之后检查EXPLAIN ANALYZE
和时间。当发现时间增加太多时,您会更详细地研究它。
对于内部选择,您可以尝试使用包含所有这9个字段的复合索引来帮助进行分组。
此外,您需要为hours
使用一个索引来帮助排名功能。
每次都重新测试一次,首先创建选择,添加一个字段,添加分组依据,添加聚合函数,添加rank(),如果发现时间增加,则添加一个索引。