减速机因死亡主机而卡住

时间:2012-08-08 18:57:53

标签: hadoop

我注意到我的减速器由于死了主机而卡住了。在日志中,它显示了很多重试消息。是否有可能告诉作业跟踪器放弃死节点并恢复工作?有323个映射器,只有1个减速器。我在hadoop-1.0.3上。

2012-08-08 11:52:19,903 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 65 seconds.
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 5 seconds.
2012-08-08 11:53:29,906 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 copy failed: attempt_201207191440_0203_m_000001_0 from 192.168.1.23
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: java.net.NoRouteToHostException: No route to host
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
    at java.net.Socket.connect(Socket.java:546)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
    at sun.net.www.http.HttpClient.New(HttpClient.java:321)
    at sun.net.www.http.HttpClient.New(HttpClient.java:338)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)

2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201207191440_0203_r_000000_0: Failed fetch #18 from attempt_201207191440_0203_m_000001_0
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 adding host 192.168.1.23 to penalty box, next contact in 1124 seconds
2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0: Got 1 map-outputs from previous failures
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 1089 seconds.

我不管它,它重试了一段时间然后放弃了死去的主机并重新运行映射器并成功了。它是由主机上的两个ip寻址造成的,我故意关闭一个ip,这是一个hadoop使用的。

我的问题是,是否有办法告诉hadoop在不重试的情况下放弃死去的主人。

1 个答案:

答案 0 :(得分:3)

从日志中可以看到无法连接运行地图任务的任务工具之一。 reducer运行的tasktracker试图通过HTTP协议检索映射中间结果,但它失败了,因为具有结果的tasktracker已经死了。

tasktracker失败的默认行为如下:

如果失败的tasktracker属于不完整的作业,则jobtracker会安排在失败的tasktracker上成功运行和完成的map任务,因为reduce任务可能无法访问驻留在失败的tasktracker的本地文件系统上的中间输出。正在进行的任何任务也将重新安排。

问题在于,如果任务(无论是地图还是减少)失败太多次(我认为4次),它将不再被重新安排,并且作业将失败。 在您的情况下,映射似乎成功完成,但reducer无法连接到映射器并检索中间结果。它尝试了4次,然后失败了。

失败的任务,不能完全被忽略,因为它是工作的一部分,除非工作包含的所有任务都成功,否则工作本身不会成功。

尝试找到reducer尝试访问的链接,并在浏览器中将其复制以查看您获得的错误。

您还可以将Hadoop使用的节点列表中的节点列入黑名单并完全排除:

  In conf/mapred-site.xml

  <property>
     <name>mapred.hosts.exclude</name>
     <value>/full/path/of/host/exclude/file</value>
  </property>

  To reconfigure nodes.

  /bin/hadoop mradmin -refreshNodes