Question

我们正在为业务用户开发一组hive查询，并注意到此特定查询需要花费大量时间：

SELECT t.country, t.site, t.year, a.name, COUNT(t.*)
from 
(SELECT DISTINCT country, site, year, month FROM signals) t, 
(SELECT DISTINCT site, country, name from master_data) a 
WHERE t.site = a.site and t.country = a.country 
GROUP BY t.country, t.site, t.year, a.name;

每个子选择本身大约需要25秒。没有抓取名称的查询需要2分钟，但是尽快加入到来的时间正在爆炸。

你知道为什么执行时间会迅速增加吗？

P.S。 t返回90个条目，a返回263

Answer 1

我建议您使用EXPLAIN或EXPLAIN EXTENDED（更多详细信息，请参阅https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain），了解查询是如何实际执行的。研究执行计划可以揭示原因。

默认情况下，Hive使用Map Reduce作为执行引擎。您可以尝试执行引擎tez，这在复杂查询方面效率更高。在SELECT语句之前添加以下行。

set hive.execution.engine=tez;

查看您的查询。如果没有name，则在不包含a时，联接中的右侧子选择（名为name）将返回更少的记录。最有可能是10。

尝试使用JOIN子句并在连接中指定连接谓词。执行计划不同且更加优化：

SELECT t.country, t.site, t.year, a.name, COUNT(t.*)
from 
(SELECT DISTINCT country, site, year, month FROM signals) t, 
INNER JOIN (SELECT DISTINCT country, site, name from master_data) a 
      ON t.site = a.site and t.country = a.country 
GROUP BY t.country, t.site, t.year, a.name;

尝试不带子选择的版本：

SELECT t.country, t.site, t.year, a.name, COUNT(t.*)
FROM signals t, 
INNER JOIN master_data a 
      ON t.site = a.site and t.country = a.country 
GROUP BY t.country, t.site, t.year, a.name;

Hive查询花费的时间比预期的要长

1 个答案: