Question

阅读此question后，我想提出其他问题：

Cluster Manager是一个长期运行的服务，它在哪个节点上运行？
Master和Driver节点是否可能是同一台机器？我认为应该有一个规则说明这两个节点应该不同吗？
如果Driver节点出现故障，谁负责重新启动应用程序？什么会发生什么？即主节点，Cluster Manager和Workers节点将如何参与（如果它们）以及以何种顺序？
与上一个问题类似：如果主节点发生故障，将会发生什么情况以及谁负责从故障中恢复？

Answer 1

1. The Cluster Manager is a long-running service, on which node it is running?

Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.

2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?

Master is per cluster and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.

In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN

3. In case where the Driver node fails, who is responsible for re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?

If the driver fails all executors tasks will be killed for that submitted/triggered spark application.

4. In case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?

Master node failures are handled in two ways.

Standby Masters with ZooKeeper:

Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected. check here for configurations
Single-Node Recovery with Local File System:

ZooKeeper is the best way to go for production-level high availability, but if you just want to be able to restart the Master if it goes down, FILESYSTEM mode can take care of it. When applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master process. check here for conf and more details

Answer 2

Cluster Manager是一个长期运行的服务，它在哪个节点上运行？

集群管理器只是SchedulerBackends用于启动任务的资源（即CPU和RAM）的管理器。集群管理器不再对Apache Spark做任何事情，而是提供资源，一旦Spark执行器启动，它们就会直接与驱动程序通信以运行任务。

您可以通过执行以下命令启动独立主服务器：

./sbin/start-master.sh

可以在任何地方开始。

在Spark群集上运行应用程序

./bin/spark-shell --master spark://IP:PORT

Master和Driver节点是否可能是同一台机器？我认为应该有一个规则说明这两个节点应该是不同的吗？

在独立模式下，当您启动计算机时，某些JVM将启动。您的SparK Master将启动并在每台计算机上工作JVM将启动，他们将向Spark Master注册。两者都是资源管理器。当您启动应用程序或以群集模式提交应用程序时，驱动程序将在您启动该应用程序的任何地方启动。驱动程序JVM将与执行程序（Ex）的SparK Master联系，在独立模式下，Worker将启动Ex。所以Spark Master是每个集群，驱动程序JVM是每个应用程序。

如果Driver节点出现故障，谁负责重新启动应用程序？什么会发生什么？即主节点，Cluster Manager和Workers节点将如何参与（如果它们）以及以何种顺序？

如果Ex JVM崩溃，那么Worker JVM将启动Ex，当Worker JVM发生故障时，Spark Master将启动它们。使用具有集群部署模式的Spark独立集群，您还可以指定 - supervise ，以确保驱动程序在出现非零退出代码时自动重新启动.Spark Master将启动驱动程序JVM

与上一个问题类似：如果主节点出现故障，什么会发生什么，谁有责任从失败中恢复？

主机失败将导致执行程序无法与之通信。所以，他们将停止工作。主人失败将使司机无法与其进行通信以获得工作状态。因此，您的应用程序将失败。正在运行的应用程序将承认主丢失，但除非两个重要的例外，否则这些应该继续或多或少地起作用：

1.应用程序无法以优雅的方式完成。

2.如果Spark Master关闭，Worker将尝试重新注册WithMaster。如果多次失败，工人就会放弃。

reregisterWithMaster（） - 重新注册此工作人员一直在与之通信的活动主人。如果没有，那么这意味着该工作人员仍在自举，并且还没有与主人建立连接，在这种情况下我们应该重新注册所有主人。在故障期间仅重新注册活动主设备非常重要。工作人员无条件地尝试向所有主设备重新注册，可能会出现种族情况。错误详见SPARK-4592：

此时长时间运行的应用程序无法继续处理，但仍然不会导致立即失败。相反，应用程序将等待主服务器返回联机状态（文件系统恢复）或来自新领导者的联系人（Zookeeper模式），如果发生这种情况，它将继续处理。

了解Spark：Cluster Manager，Master和Driver节点

2 个答案: