Question

我在大约一千个avro记录的数据集上有一个Yarn MR（有两个ec2实例来mapreduce）作业，并且地图阶段表现不正常。请参阅以下进度。当然，我检查了resourcemanager和nodemanagers上的日志，并没有看到任何可疑，但这些日志太冗长

那里发生了什么？

        hive> select * from nikon where qs_cs_s_aid='VIEW' limit 10;

        Total MapReduce jobs = 1
        Launching Job 1 out of 1
        Number of reduce tasks is set to 0 since there's no reduce operator
        Starting Job = job_1352281315350_0020, Tracking URL = http://blabla.ec2.internal:8088/proxy/application_1352281315350_0020/
        Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=blabla.com:8032 -kill job_1352281315350_0020
        Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0

        2012-11-07 11:14:40,976 Stage-1 map = 0%,  reduce = 0%
        2012-11-07 11:15:06,136 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 10.38 sec
        2012-11-07 11:15:07,253 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 12.18 sec
        2012-11-07 11:15:08,371 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 12.18 sec
        2012-11-07 11:15:09,491 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 12.18 sec
        2012-11-07 11:15:10,643 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 15.42 sec
        (...)
        2012-11-07 11:15:35,441 Stage-1 map = 28%,  reduce = 0%, Cumulative CPU 37.77 sec
        2012-11-07 11:15:36,486 Stage-1 map = 28%,  reduce = 0%, Cumulative CPU 37.77 sec

here restart at 16% ?

        2012-11-07 11:15:37,692 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 21.15 sec
        2012-11-07 11:15:38,815 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 21.15 sec
        2012-11-07 11:15:39,865 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 21.15 sec
        2012-11-07 11:15:41,064 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 22.4 sec
        2012-11-07 11:15:42,181 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 22.4 sec
        2012-11-07 11:15:43,299 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 22.4 sec

here restart at 0% ?

        2012-11-07 11:15:44,418 Stage-1 map = 0%,  reduce = 0%
        2012-11-07 11:16:02,076 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 6.86 sec
        2012-11-07 11:16:03,193 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 6.86 sec
        2012-11-07 11:16:04,259 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 8.45 sec
        (...)
        2012-11-07 11:16:31,291 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 35.34 sec
        2012-11-07 11:16:32,414 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 37.93 sec

here restart at 11% ?

        2012-11-07 11:16:33,459 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 19.53 sec
        2012-11-07 11:16:34,507 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 19.53 sec
        2012-11-07 11:16:35,731 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 21.47 sec
        (...)
        2012-11-07 11:16:46,839 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU 24.14 sec

here restart at 0% ?

        2012-11-07 11:16:47,939 Stage-1 map = 0%,  reduce = 0%
        2012-11-07 11:16:56,653 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 7.54 sec
        2012-11-07 11:16:57,814 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 7.54 sec
        (...)

毋庸置疑，一段时间后作业崩溃并出现错误：java.io.IOException：java.io.IOException：java.lang.ArrayIndexOutOfBoundsException：-56

Answer 1

这看起来就像hadoop在失败时重试地图任务（默认情况下，它会重试3次，每次都在不同的主机上），这就是它使你的工作更容错的方式。

如果失败是由特定主机上的临时问题引起的（这种情况比您想象的要多），这将非常有用。但是在你的情况下，你真的有一个数组越界异常由你的hive查询中的东西引起。我会检查失败的任务日志以尝试调试原因。

Hive NR地图进度不一致，并从0％重新启动

1 个答案: