我的Hive查询运行了很长时间

时间:2017-03-14 19:27:43

标签: hive

我从hive CLI运行以下查询。 查询运行了很长时间,然后失败。

SET hive.tez.container.size=10240; 
SET hive.tez.java.opts=-Xmx8192m; 
set tez.runtime.io.sort.mb=4096; 
set tez.runtime.unordered.output.buffer.size-mb=1024; 
set hive.exec.dynamic.partition=true; 
set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.vectorized.execution.reduce.enabled; 
set hive.execution.engine=tez;

SELECT 
cust_his.cname AS cname  
,cust_his.creg AS creg 
,Upper(Trim(cust_his.ccountry)) AS ccountry 
,Upper(Trim(cust_his.cloc)) AS cloc
FROM  
customer_history cust_his
WHERE  
cust_his.cust_d BETWEEN 20160501  AND 20160531
AND Substr(Trim(cust_his.cloc), 1, Locate('|', cust_his.cloc, 1) - 1) <> ''
AND Substr(Trim(cust_his.cloc), 1, Locate('|', cust_his.cloc, 1) - 1) IS NOT NULL
AND cast(Trim(cust_his.cmfid) as int) NOT IN ( 1,2,3 )
AND cust_his.cmat = '8';

该表在cust_d列上进行了分区。 该表有420TB的数据。

请帮我解决这个问题。

先谢谢。

1 个答案:

答案 0 :(得分:0)

您的查询应仅在mapper上运行,因为没有group by或join或files merge。

检查有多少映射器启动并且分区修剪工作,检查查询计划。 如果分区修剪不起作用,请尝试将BETWEEN条件替换为>=<=,有时会有所帮助,矢量化执行可能不支持您的版本BETWEEN,具体如下: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/query-vectorization.html

另外添加: set hive.vectorized.execution.enabled = true;

这种情况也是多余的:

AND Substr(Trim(cust_his.cloc), 1, Locate('|', cust_his.cloc, 1) - 1) IS NOT NULL

您不需要它,因为您已经拥有相同的<> '',这会自动排除nulls