大型MapReduce工作一直在濒临死亡

时间:2018-02-17 16:49:15

标签: hadoop hbase

我正在尝试在子类为TableMapper的~10TB HBase表上运行MapReduce作业。它基本上重写了整个表。输出配置如下:

    FileOutputFormat.setOutputPath(job, tablePath);

    TableMapReduceUtil.initTableMapperJob(
            inputTableName,
            tblScanner,
            ResaltMapper.class,
            ImmutableBytesWritable.class, //outputKeyClass,
            KeyValue.class, // outputValueClass,
            job);

    HFileOutputFormat.configureIncrementalLoad(job, hTable);

我已经尝试过几次这样的工作,每次都会在几个小时后死掉。我在应用程序日志中看到以下消息:

    {"timeStamp":"18/02/17 14:48:26,375","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000063_0 "}
    {"timeStamp":"18/02/17 14:48:26,376","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000101_0 "}
    {"timeStamp":"18/02/17 14:48:26,377","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000099_0 "}
    {"timeStamp":"18/02/17 14:48:26,377","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000112_0 "}
    {"timeStamp":"18/02/17 14:48:26,381","level":"WARN","category":"hdfs.DFSClient","message":"Slow ReadProcessor read fields took 152920ms (threshold=30000ms); ack: seqno: 1 reply: 0 reply: 0 reply: 0 downstreamAckTimeNanos: 20402922, targets: [DatanodeInfoWithStorage[10.40.177.236:50010,DS-4d0bd79b-eaf3-4ec0-93f1-203b74bdf87b,DISK], DatanodeInfoWithStorage[10.40.176.118:50010,DS-8506c9ff-206d-48c5-b476-04b8dc396a1c,DISK], DatanodeInfoWithStorage[10.40.186.216:50010,DS-36dece52-50c7-47b0-a202-2ee595fabbcc,DISK]] "}
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

我也从应用报告中看到此消息

NodeHealthReport    1/1 local-dirs are bad: /mnt/yarn/local; 1/1 log-dirs are bad: /mnt/yarn/logs

我不确定这些消息是否与失败有关。群集上有足够的空间,有4个d2.8x大型实例(4台机器上有96个2TB硬盘)。但是,特定的硬盘驱动器正在被填满。例如,在当前的工作中,只有大约9GB可用,即使其他硬盘几乎是空的一半:

$ df -h
Filesystem                    Size  Used Avail Use% Mounted on
/dev/xvda1                     99G  5.0G   90G   6% /
none                          4.0K     0  4.0K   0% /sys/fs/cgroup
udev                          121G   12K  121G   1% /dev
tmpfs                          25G  672K   25G   1% /run
none                          5.0M     0  5.0M   0% /run/lock
none                          121G   32K  121G   1% /run/shm
none                          100M     0  100M   0% /run/user
/dev/mapper/ephemeral_luks0   1.8T  1.7T  9.0G 100% /mnt
/dev/mapper/ephemeral_luks1   1.8T  974G  767G  56% /mnt1
/dev/mapper/ephemeral_luks2   1.8T  982G  760G  57% /mnt2
/dev/mapper/ephemeral_luks3   1.8T  997G  745G  58% /mnt3
/dev/mapper/ephemeral_luks4   1.8T  982G  760G  57% /mnt4
...snip...

有没有人知道造成这种情况的原因是什么?我该如何解决这个问题呢?

1 个答案:

答案 0 :(得分:1)

我想通了,这是因为yarn.nodemanager.local-dirs在集群中的每个节点上只设置了一个HDD。为每个节点指定每个HDD可以解决问题。