我有一个关于Hive的问题。让我向你解释一下这个场景:
以下是查询的内容:
INSERT OVERWRITE TABLE final_table
SELECT
T1.Id,
T1.some_field_name,
T1.another_filed_name,
T2.also_another_filed_name,
FROM table1 T1
LEFT JOIN table2 T2 ON ( T2.Id = T1.Id ) -- T2 is the smallest table
LEFT JOIN table3 T3 ON ( T3.Id = T1.Id )
LEFT JOIN table4 T4 ON ( T4.Id = T1.Id ) -- T4 is the biggest table
那么,知道查询的结构是否有办法重写它以便我可以避免太多的JOIN?
提前致谢
PS:偶数矢量化给了我相同的时间
答案 0 :(得分:0)
评论太长,以后会被删除。
(1)您当前的查询无法编译
(2)您没有从T3
和T4
中选择任何内容,这没有任何意义。
(3)更改表格的顺序不会对基于成本的优化程序产生任何影响
(4)基本上我建议收集有关表格的统计信息,特别是id
列,但在您的情况下我感觉id
并不是唯一的超过1桌。
将以下查询的结果添加到您的帖子中:
select *
, case when cnt_1 = 0 then 1 else cnt_1 end
* case when cnt_2 = 0 then 1 else cnt_2 end
* case when cnt_3 = 0 then 1 else cnt_3 end
* case when cnt_4 = 0 then 1 else cnt_4 end as product
from (select id
,count(case when tab = 1 then 1 end) as cnt_1
,count(case when tab = 2 then 1 end) as cnt_2
,count(case when tab = 3 then 1 end) as cnt_3
,count(case when tab = 4 then 1 end) as cnt_4
from ( select 1 as tab,id from table1
union all select 2 as tab,id from table2
union all select 3 as tab,id from table3
union all select 4 as tab,id from table4
) t
group by id
having greatest (cnt_1,cnt_2,cnt_3,cnt_4) >= 10
) t
order by product desc
limit 10
;