我运行以下hql:
select new.uid as uid, new.category_id as category_id, new.atag as atag,
new.rank_idx + CASE when old.rank_idx is not NULL then old.rank_idx else 0 END as rank_idx
from (
select a1.uid, a1.category_id, a1.atag, row_number() over(distribute by a1.uid, a1.category_id sort by a1.cmt_time) as rank_idx from (
select app.uid,
CONCAT(cast(app.knowledge_point_id_list[0] as string),'#',cast(app.type_id as string)) as category_id,
app.atag as atag, app.cmt_time as cmt_time
from model.mdl_psr_app_behavior_question_result app
where app.subject = 'english'
and app.dt = '2016-01-14'
and app.cmt_timelen > 1000
and app.cmt_timelen < 120000
) a1
) new
left join (
select uid, category_id, rank_idx from model.mdl_psr_mlc_app_count_last
where subject = 'english'
and dt = '2016-01-13'
) old
on new.uid = old.uid
and new.category_id = old.category_id
最初mdl_psr_mlc_app_count_last和mdl_psr_mlc_app_count_day存储为JsonSerde,查询运行。
我的同事认为JsonSerde效率很低,占用的空间太大。 PARQUET对我来说是更好的选择。
当我这样做时,查询破坏了以下错误日志:
org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理1行:used memory = 1024506232 2016-01-19 16:36:56,119 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理10行:used memory = 1024506232 2016-01-19 16:36:56,130 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理100行:used memory = 1024506232 2016-01-19 16:36:56,248 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理1000行:used memory = 1035075896 2016-01-19 16:36:56,694 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理10000行:used memory = 1045645560 2016-01-19 16:36:57,056 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:ExecMapper:处理100000行:used memory = 1065353232
它看起来像java内存问题。有人建议我试试:
SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=8048;
SET mapreduce.reduce.java.opts='-Xmx8048M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx4096M';
set mapred.child.map.java.opts='-Xmx4096M';
它仍然会中断,并显示相同的错误消息。现在别人建议:
SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=1024;
SET mapreduce.reduce.java.opts='-Xmx1024M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx1024M';
set mapreduce.child.map.java.opts='-Xmx1024M';
set mapred.reduce.tasks = 40;
现在它运行没有故障。
有人能解释我为什么吗?
================================ 顺便说一句:虽然它运行,但减少步骤非常慢。当你在这里时,你能解释一下为什么吗?
答案 0 :(得分:0)
出于某种原因,YARN对镶木地板的支持很差。
引用Mapr
例如,如果MapReduce作业对镶木地板文件进行排序,则Mapper需要将整个Parquet行组缓存在内存中。我已经做了测试来证明镶木地板文件的行组大小越大,需要更大的Mapper内存。在这种情况下,请确保Mapper内存足够大而不会触发OOM。
我不确定为什么问题中的不同设置很重要,但简单的解决方案是放下镶木地板并使用兽人。交换无bug时会有一点性能损失。