DataProc Cluster Spark作业提交无法启动NodeManager

时间:2019-09-03 23:21:11

标签: apache-spark google-cloud-platform google-cloud-dataproc

我们有配置了4个工作器的Dataproc集群。群集已启动并正在运行,每当我们尝试提交spark-job时,都会出现此错误:

YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager

Stackdriver日志中显示的某些消息是

Daemon YARN_NODE_MANAGER failed to restart

更新: 即使我们在现有的Dataproc集群中添加新的工作节点,也会注意到此问题。

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from <MasterNode DNS> , Sending SHUTDOWN signal to the NodeManager.
    at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
    at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:845)
    at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:912)

1 个答案:

答案 0 :(得分:1)

此错误看起来像是YARN节点管理器停用问题。您能否检查Dataproc主GCE VM中的以下YARN包含/排除节点配置文件是否有错误:

  • / etc / hadoop / conf / nodes_exclude
  • / etc / hadoop / conf / nodes_include

更改这些配置文件后,请运行refresh node命令:

yarn rmadmin -refreshNodes 

那么您应该期望看到Nodemanager重新加入YARN。

有关详细信息,请参阅: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html#nodeslistmanager-detects-and-handles-include-and-exclude-list-changes