Question

我应该在hive中创建和删除中间表吗？

我可以写类似（简化）：

drop table if exists tmp1;
create table tmp1 as
select a, b, c
from input1
where a > 1 and b < 3;

drop table if exists tmp2;
create table tmp2 as
select x, y, z
from input2
where x < 6;

drop table if exists output;
create table output as
select x, a, count(*) as count
from tmp1 join tmp2 on tmp1.c = tmp2.z
group by tmp1.b;
drop table tmp1;
drop table tmp2;

或者我可以把所有内容都卷成一个声明：

drop table if exists output;
create table output as
select x, a, count(*) as count
from (select a, b, c
    from input1
    where a > 1 and b < 3) t1
join (select x, y, z
    from input2
    where x < 6) t2
on t1.c = t2.z
group by t1.b;

显然，如果我不止一次地重用中间表，那么创建它们就非常有意义了。但是，当他们只使用一次时，我可以选择。

我尝试了两者，第二个是 6％更快，按墙上时间衡量，但 4％更慢< / em>由MapReduce Total cumulative CPU time日志输出测量。这种差异可能在随机误差范围内（由其他过程和c引起）。但是，组合查询是否有可能导致显着的加速？

另一个问题是：是中间表，只使用一次，在hive代码中是否正常发生，或者应该尽可能避免使用？

Answer 1

有一个显着的区别运行一个大查询将允许优化器在优化中更自由在这种情况下，最重要的优化之一是hive.exec.parallel中设置的并列。当设置为true时，hive将并行执行独立的阶段在您的情况下，在第二个查询中想象t1，t2执行更复杂的工作，如group by。在第二个查询t1中，t2将执行simultaniusly，而在第一个脚本中将是串行的。

Answer 2

我喜欢创建多个视图，然后只在最后创建一个表。这允许Hive优化器减少map-reduce步骤的数量，并且像dimamah和Nigel指出的那样并行执行，但有助于保持非常复杂的管道的可读性。

对于您的示例，您可以将其替换为

CREATE VIEW IF NOT EXISTS tmp1_view
AS
SELECT a, b, c FROM inputs
where a > 1 and b < 3;


create view if not exists tmp2_view as
select x, y, z_
from input2
where x < 6;

drop table if exists output;
create table output as
select x, a, count(*) as count
from tmp1_view join tmp2_view on tmp1_view.c = tmp2_view.z
group by tmp1_view.b;

Answer 3

我认为合并查询是件好事。它允许Hive查询优化器优化查询。

考虑这个愚蠢的问题：

SELECT COUNT(*) FROM (SELECT * FROM clicks WHERE dt = '2014-01-07') t;

当你运行它时，Hive将只启动一个MapReduce作业。

使用中间表

CREATE TABLE tmp AS SELECT * FROM clicks WHERE dt = '2014-01-07';
SELECT COUNT(*) FROM tmp;

显然会运行两个MapReduce作业。

所以回答你的问题：是的，组合查询可能会导致加速。

Answer 4

正如您所发现的，时间可能没有太大差异。您很可能希望维护（a）“保存点”/中间回滚或（b）故障排除目的的临时表。否则，管理工作可能不值得记住（或自动化）中间表的清理/删除。

Hive SQL编码风格：中间表？

4 个答案: