Question

我写了一个mapreduce作业来从数据集中提取一些信息。数据集是用户对电影的评分。用户数约为250K，电影数约为300k。地图的输出为<user, <movie, rating>*> and <movie,<user,rating>*>。在reducer中，我将处理这些对。

但是当我运行这个工作时，映射器按预期完成，但是reducer总是抱怨

Task attempt_* failed to report status for 600 seconds.

我知道这是因为无法更新状态，因此我在代码中添加了对context.progress()的调用，如下所示：

int count = 0;
while (values.hasNext()) {
  if (count++ % 100 == 0) {
    context.progress();
  }
  /*other code here*/
}

不幸的是，这没有用。仍有许多减少任务失败。

这是日志：

Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!

BTW，错误发生在reduce to copy阶段，日志说：

reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385

感谢您的帮助。

Answer 1

最简单的方法是设置此配置参数：

<property>
  <name>mapred.task.timeout</name>
  <value>1800000</value> <!-- 30 minutes -->
</property>

mapred-site.xml

中的

Answer 2

最简单的另一种方法是在程序中设置作业配置

 Configuration conf=new Configuration();
 long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
 conf.setLong("mapred.task.timeout", milliSeconds);

**在设置之前，请检查jobtracker GUI中的Job文件（job.xml）文件，了解正确的属性名称，无论是mapred.task.timeout还是mapreduce.task.timeout 。。。在运行作业时，再次检查作业文件中是否根据设置的值更改了该属性。

Answer 3

在较新版本中，参数名称已更改为mapreduce.task.timeout，如link（搜索task.timeout）中所述。此外，您还可以按照上面的链接中所述禁用此超时：

任务终止前的毫秒数既不读取输入，也不写入输出，也不更新其状态串。值为0将禁用超时。

以下是mapred-site.xml中的示例设置：

<property>
  <name>mapreduce.task.timeout</name>
  <value>0</value> <!-- A value of 0 disables the timeout -->
</property>

Answer 4

如果你有hive查询及其超时，你可以通过以下方式设置上述配置：

set mapred.tasktracker.expiry.interval = 1800000;

设置mapred.task.timeout = 1800000;

Answer 5

来自https://issues.apache.org/jira/browse/HADOOP-1763

原因可能是：

1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run. 
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.

如何修复“任务尝试_201104251139_0295_r_000006_0无法报告状态600秒”。

5 个答案: