Question

我有一个表，每个分区（ds）中有大约300,000条记录。

在hive中运行以下查询时（使用substr选择），它将挂在步骤：map = 0％

select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
    select substr(stat_time,1,8) as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
    from db.table
    where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id

但是，如果我将内部查询从选择substr（stat_time，1,8）替换为stat_date 到选择stat_time作为stat_date ，它将正常执行。

stat_time 格式为YYYYMMDDHHmm = 201702210900

select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
    select stat_time as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
    from db.table
    where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id

那么为什么substr导致性能下降？

-

编辑：

我在mapred-site.xml中更改了mapred.child.java.opts，在hadoop-env.sh中将HADOOP_HEAPSIZE更改为4G。它成功了。

我想要么保存或计算所有的substr会导致很多堆。

如果有人知道为什么使用普通字段值导致的内存少于substr，请发表评论/回答。

Hive substr with group by导致性能降低

0 个答案: