Question

我是新手，并且尝试获取大文件（未压缩的1.25 TB）hdfs文件并将其放入Hive托管表中。它已经在csv格式（来自sqoop）的HDFS上带有任意分区，并且我将其放入一种更有条理的格式中以进行查询和加入。我在使用Tez的HDP 3.0上。这是我的hql：

USE MYDB;

DROP TABLE IF EXISTS new_table;

CREATE TABLE IF NOT EXISTS new_table (
 svcpt_id VARCHAR(20),
 usage_value FLOAT,
 read_time SMALLINT)
PARTITIONED BY (read_date INT)
CLUSTERED BY (svcpt_id) INTO 9600 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
TBLPROPERTIES("orc.compress"="snappy");

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=2000;
SET hive.exec.max.dynamic.partitions=10000;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.enforce.bucketing = true;
SET mapred.reduce.tasks = 10000;

INSERT OVERWRITE TABLE new_table
PARTITION (read_date)
SELECT svcpt_id, usage, read_time, read_date
FROM raw_table;

Tez进行此设置的方式是（由于我最近的失败）：

--------------------------------------------------------------------------------
VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1      SUCCEEDED   1043       1043        0        0       0       0
Reducer 2    RUNNING   9600        735       19     8846       0       0
Reducer 3     INITED  10000          0        0    10000       0       0
--------------------------------------------------------------------------------
VERTICES: 01/03  [==>>------------------------] 8%    ELAPSED TIME: 45152.59 s
--------------------------------------------------------------------------------

我已经为此工作了一段时间。最初，我无法运行第一个map 1顶点，因此我添加了存储桶。 96个存储桶使第一个映射器运行，但是reducer 2失败，原因是磁盘空间问题没有意义。然后，我将存储桶的数量增加到9600，并将任务减少到10000，reduce 2顶点开始运行，尽管运行缓慢。今天早上，我发现它出错了，因为由于垃圾收集器的Java堆空间错误，我的namenode已关闭。

有人对我有指导意见吗？我觉得我在黑暗中拍摄大量的减少任务，水桶数量以及下面显示的所有配置。

hive.tez.container.size = 5120MB
hive.exec.reducers.bytes.per.reducer = 1GB
hive.exec.max.dynamic.partitions = 5000
hive.optimize.sort.dynamic.partition = FALSE
hive.vectorized.execution.enabled = TRUE
hive.vectorized.execution.reduce.enabled = TRUE
yarn.scheduler.minimum-allocation-mb = 2G
yarn.scheduler.maximum-allocation-mb = 8G
mapred.min.split.size=?
mapred.max.split.size=?
hive.input.format=?
mapred.min.split.size=?

尚未设置LLAP

我的集群有4个节点，32个核心和120 GB的内存。我没有使用超过群集存储的1/3。

配置大型Hive导入作业

0 个答案: