Question

当我调用./stop-yarn.sh然后调用./start-yarn.sh时，所有正在进行的作业将打印如下内容：

14/10/22 16:23:28 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:29 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:30 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:31 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:32 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:33 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:34 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:35 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:36 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:37 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

或：

14/10/22 16:28:19 ERROR security.UserGroupInformation: PriviledgedActionException as:supertool (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1413966215954_0002' doesn't exist in RM.

有没有办法在不影响所有正在进行的工作的情况下重启YARN？非常感谢〜

Answer 1

您需要配置高可用性的ResourceManager。阅读Deploy ResourceManager HA Cluster了解如何配置此类群集。然后，您将能够手动或自动地对RM进行故障转移。

此链接说明了更多内容：ResourceManager High Availability

实际上，从2.4.0开始，可以重新启动RM并保持接受的应用程序（MR作业）没有辅助HA RM。见ResourceManger Restart：

ResourceManager是管理资源和调度在YARN上运行的应用程序的中央权限。因此，它可能是Apache YARN集群中的单点故障。

本文档概述了ResourceManager Restart，这是一项增强ResourceManager以在重新启动时保持正常运行的功能，并且还使ResourceManager的停机时间对最终用户不可见。

ResourceManager Restart功能分为两个阶段：



阶段1 ：增强RM以在可插拔的状态存储中保留应用程序/尝试状态和其他凭据信息。 RM将在重新启动时从状态存储重新加载此信息，并重新启动以前运行的应用程序。用户无需重新提交申请。



第2阶段：重点关注重新构建ResourceManger的运行状态，方法是重新启动NodeMangers中的容器状态和ApplicationMasters的容器请求。与第1阶段的主要区别在于，在RM重启后，以前运行的应用程序不会被终止，因此应用程序不会因为RM中断而丢失其工作。



自Hadoop 2.4.0发布以来，仅实施了ResourceManager Restart Phase 1，如下所述。

Answer 2

简单回答：否！

当您停止资源管理器时，tasktrackers和datanode无法与主节点通信，因此无法与彼此通信（因为他们不知道从哪里询问其输入数据）。而且，节点不知道数据的存储位置。这些都是存储在（主）中的信息。当作业正在运行时，需要所有这些信息才能继续，因此停止资源管理器，正在运行的作业将失败。

编辑：显然，事情并不那么简单，因为@Remus Rusanu的回答证明了这一点：）

Hadoop：如何重新启动YARN而不干扰所有正在进行的工作？

2 个答案: