提取失败太多:群集上的Hadoop(x2)

时间:2011-05-20 14:24:01

标签: hadoop

我在过去一周左右一直在使用Hadoop(试图掌握它),虽然我已经能够设置多节点集群(2台机器:1台笔记本电脑和一个小型桌面)并检索结果当我运行一个hadoop工作时,我似乎总是遇到“太多的获取失败”。

示例输出(在一个简单的wordcount示例上)是:

hadoop@ap200:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-0.20.203.0.jar wordcount sita sita-output3X
11/05/20 15:02:05 INFO input.FileInputFormat: Total input paths to process : 7
11/05/20 15:02:05 INFO mapred.JobClient: Running job: job_201105201500_0001
11/05/20 15:02:06 INFO mapred.JobClient:  map 0% reduce 0%
11/05/20 15:02:23 INFO mapred.JobClient:  map 28% reduce 0%
11/05/20 15:02:26 INFO mapred.JobClient:  map 42% reduce 0%
11/05/20 15:02:29 INFO mapred.JobClient:  map 57% reduce 0%
11/05/20 15:02:32 INFO mapred.JobClient:  map 100% reduce 0%
11/05/20 15:02:41 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:02:49 INFO mapred.JobClient: Task Id :      attempt_201105201500_0001_m_000003_0, Status : FAILED
Too many fetch-failures
11/05/20 15:02:53 INFO mapred.JobClient:  map 85% reduce 9%
11/05/20 15:02:57 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:03:10 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000002_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:14 INFO mapred.JobClient:  map 85% reduce 9%
11/05/20 15:03:17 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:03:25 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000006_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:29 INFO mapred.JobClient:  map 85% reduce 9%
11/05/20 15:03:32 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:03:35 INFO mapred.JobClient:  map 100% reduce 28%
11/05/20 15:03:41 INFO mapred.JobClient:  map 100% reduce 100%
11/05/20 15:03:46 INFO mapred.JobClient: Job complete: job_201105201500_0001
11/05/20 15:03:46 INFO mapred.JobClient: Counters: 25
11/05/20 15:03:46 INFO mapred.JobClient:   Job Counters 
11/05/20 15:03:46 INFO mapred.JobClient:     Launched reduce tasks=1
11/05/20 15:03:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=72909
11/05/20 15:03:46 INFO mapred.JobClient:     Total time spent by all reduces waiting  after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient:     Launched map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient:     Data-local map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=76116
11/05/20 15:03:46 INFO mapred.JobClient:   File Output Format Counters 
11/05/20 15:03:46 INFO mapred.JobClient:     Bytes Written=1412473
11/05/20 15:03:46 INFO mapred.JobClient:   FileSystemCounters
11/05/20 15:03:46 INFO mapred.JobClient:     FILE_BYTES_READ=4462381
11/05/20 15:03:46 INFO mapred.JobClient:     HDFS_BYTES_READ=6950740
11/05/20 15:03:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=7546513
11/05/20 15:03:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1412473
11/05/20 15:03:46 INFO mapred.JobClient:   File Input Format Counters 
11/05/20 15:03:46 INFO mapred.JobClient:     Bytes Read=6949956
11/05/20 15:03:46 INFO mapred.JobClient:   Map-Reduce Framework
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce input groups=128510
11/05/20 15:03:46 INFO mapred.JobClient:     Map output materialized bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient:     Combine output records=201001
11/05/20 15:03:46 INFO mapred.JobClient:     Map input records=137146
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce shuffle bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce output records=128510
11/05/20 15:03:46 INFO mapred.JobClient:     Spilled Records=507835
11/05/20 15:03:46 INFO mapred.JobClient:     Map output bytes=11435785
11/05/20 15:03:46 INFO mapred.JobClient:     Combine input records=1174986
11/05/20 15:03:46 INFO mapred.JobClient:     Map output records=1174986
11/05/20 15:03:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=784
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce input records=201001

我对这个问题做了一个谷歌,而apache的人似乎暗示它可能是网络问题(或与/ etc / hosts文件有关)或者可能是从属节点上的损坏磁盘。

添加:我确实在namenode管理面板(localhost:50070 / dfshealth)上看到了2个“实时节点”,在Map / reduce Admin下,我看到了2个节点。

有关如何避免这些错误的任何线索? 提前谢谢。

编辑:1:

任务跟踪器日志已开启:http://pastebin.com/XMkNBJTh datanode日志已打开:http://pastebin.com/ttjR7AYZ

非常感谢。

3 个答案:

答案 0 :(得分:2)

修改datanode节点/ etc / hosts文件。

每条线分为三个部分。第一部分是网络IP地址,第二部分是主机名或域名,第三部分是主机别名,详细步骤如下:

  1. 首先检查主机名:

    cat / proc / sys / kernel / hostname

    您会看到HOSTNAME属性。在OK上更改IP后面的值,然后退出。

  2. 使用命令:

    hostname ***. ***. ***. ***

    Asterisk被相应的IP取代。

  3. 类似地修改hosts配置,如下所示:

    127.0.0.1 localhost.localdomain localhost :: 1 localhost6.localdomain6 localhost6 10.200.187.77 10.200.187.77 hadoop-datanode

  4. 如果配置并成功修改了IP地址,或者显示主机名存在问题,请继续修改主机文件。

答案 1 :(得分:1)

以下解决方案肯定会有效

1.使用Ip 127.0.0.1和127.0.1.1删除或注释行

2.使用主机名作为主机文件中的引用节点和hadoop目录中的主/从文件的别名

  -->in Host file 172.21.3.67 master-ubuntu

  -->in master/slave file master-ubuntu

3。请参阅namenode的NameSpaceId = Datanode的NameSpaceId

答案 2 :(得分:0)

我遇到了同样的问题:“太多的提取失败”和非常慢的Hadoop性能(简单的wordcount示例在一个2节点的强大服务器集群上运行需要20多分钟)。我还得到了“WARN mapred.JobClient:错误读取任务outputConnection拒绝”错误。

当我遵循Thomas Jungblut的指示时,问题得到解决:我从奴隶配置文件中删除了我的主节点。在此之后,错误消失了,wordcount示例只用了1分钟。