我在三个节点上使用spark 1.2.1,这三个节点运行三个具有从属配置的工作人员并使用以下命令运行每日作业:
./spark-1.2.1/sbin/start-all.sh
//crontab configuration:
./spark-1.2.1/bin/spark-submit --master spark://11.11.11.11:7077 --driver-class-path home/ubuntu/spark-cassandra-connector-java-assembly-1.2.1-FAT.jar --class "$class" "$jar"
我希望始终保持火花主人和奴隶工人的可用性,即使失败了,我也需要像 服务 那样重新启动它(就像cassandra一样) 。
有什么办法吗?
修改
我查看了start-all.sh脚本,它只包含start-master.sh脚本和start-slaves.sh脚本的设置。 我试图为它创建一个 supervisor 配置文件,只得到以下错误:
11.11.11.11: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.13: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
11.11.11.11: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.12: ssh: connect to host 11.11.11.13 port 22: No route to host
11.11.11.11: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
11.11.11.12: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.13: ssh: connect to host 11.11.11.13 port 22: No route to host
11.11.11.11: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
答案 0 :(得分:1)
monit和supervisor(甚至systemd)等工具可以监视和重启失败的进程。