Question

我的表（MyTable〜365 GB）包含2年的客户行为数据。它按天分区，并按customer_id群集到64个存储桶中。平均一天，其中有800万个条目。

我的任务是每天（约512 MB）检索客户，并回顾他们的行为-例如最近2年的购买次数。

据我了解，左半联接将在此处适用，例如：

    WITH TabA as (SELECT cid, NUM_PURCHASES from MyTable where dt>= '20161001' and dt <= '20181001'), 
TabB as (SELECT cid from MyTable where dt='20181001') 
    SELECT TabA.cid as ID,
    SUM(TabA.NUM_PURCHASES) as total_p
    FROM TabA LEFT SEMI JOIN TabB on (TabB.cid = TabA.cid) GROUP BY TabA.cid;

在我的桌子上堆满东西时，我严重依赖于Hive join optimization中发布的联接优化建议。因此，在Hive上设置了以下参数（请注意，tez在我的环境中不起作用）：

set hive.auto.convert.join=true;
SET hive.variable.substitute.depth=150;
set hive.auto.convert.join=true;
set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.skewjoin=true;
set hive.enforce.bucketing = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.exec.parallel=true;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.cbo.enable=true;
SET mapred.child.java.opts=-Xmx4G -XX:+UseConcMarkSweepGC  -XX:-UseGCOverheadLimit;
set mapreduce.map.memory.mb=9216;
set mapreduce.reduce.memory.mb=9216;

由于出现内存问题，最后添加了三行。

我的查询在第一份工作上失败。映射程序将一直执行到100％，并且一旦reducers（似乎要启动），该作业便会重置并再次失败。群集管理器报告Java堆空间内存问题。我还尝试过减少每个映射器（6 Gb，4 Gb）和减速器（8 Gb，7 Gb，6 Gb）的内存-所有组合，但是我遇到了相同的错误。

有人可以让我了解一下a）如何进行这项工作，b）每个映射器/化简器应为我分配多少空间，以及c）如果我的查询可以优化（即按cid进行分组）在左半联接之前）？

配置单元联接优化和资源分配

0 个答案: