我一直在对56GB大小的表(789700760行)运行以下查询,并且在执行时遇到了瓶颈。从前面的一些例子中我发现,可能有一种方法可以“取消嵌套”INNER JOIN,以便查询对大型数据集的性能更好。特别是下面的查询花了7.651小时来完成MPP PostgreSQL部署的执行。
create table large_table as
select column1, column2, column3, column4, column5, column6
from
(
select
a.column1, a.column2, a.start_time,
rank() OVER(
PARTITION BY a.column2, a.column1 order by a.start_time DESC
) as rank,
last_value( a.column3) OVER (
PARTITION BY a.column2, a.column1 order by a.start_time ASC
RANGE BETWEEN unbounded preceding and unbounded following
) as column3,
a.column4, a.column5, a.column6
from
(table2 s
INNER JOIN table3 t
ON s.column2=t.column2 and s.event_time > t.start_time
) a
) b
where rank =1;
问题1:有没有办法修改上面的sql代码以加快查询的整体执行时间?
答案 0 :(得分:1)
您可以将last_value移动到外部子查询,这可能会为您带来一些性能提升。 last_value为每个开始时间最小的分区获取column3的值 - 确切地说rank = 1的位置:
select column1, column2,
ast_value( a.column3) OVER (PARTITION BY column2, column1 order by start_time ASC
RANGE BETWEEN unbounded preceding and unbounded following
) as column3,
column4, column5, column6
from (select a.column1, a.column2, a.start_time,
rank() OVER (PARTITION BY a.column2, a.column1 order by a.start_time DESC
) as rank,
a.column3, a.column4, a.column5, a.column6
from (table2 s INNER JOIN
table3 t
ON s.column2 = t.column2 and s.event_time > t.start_time
) a
) b
where rank = 1
否则,您需要提供有关执行计划以及table2和table3的更多信息以获得更多帮助。