Question

我正在尝试构建一个包含Apache Spark的docker镜像。它建立在openjdk-8-jre官方形象之上。

目标是在群集模式下执行Spark，因此至少有一个主服务器（通过sbin/start-master.sh启动）和一个或多个从服务器（sbin/start-slave.sh）。有关我的Dockerfile和入口点脚本，请参阅spark-standalone-docker。

构建本身实际上经历了，问题是当我想运行容器时，它会在不久后启动和停止。原因是Spark master启动脚本以守护进程模式启动master并退出。因此容器终止，因为前台中没有进程在运行。

显而易见的解决方案是在前台运行Spark主进程，但我无法弄清楚（Google也没有发现任何东西）。我的＆＃34;解决方案 - 解决方案＆＃34;是在Spark日志目录上运行tails -f。

因此，我的问题是：

如何在前台运行Apache Spark Master？
如果第一个不可行/可行/无论什么，保持容器“活着”的首选（即最佳实践）解决方案是什么？（我真的不想使用无限循环和睡眠命令）？

Answer 1

如何在前台运行Apache Spark Master？

您可以将spark-class与Master一起使用。

bin/spark-class org.apache.spark.deploy.master.Master

和工人一样：

bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_URL

如果您正在寻找生产就绪解决方案，您应该考虑使用合适的主管，例如dumb-init或tini。

Answer 2

更新后的答案（适用于Spark 2.4.0）：

要在前台启动spark master，只需设置ENV变量在运行./start-master.sh

之前，在您的环境上SPARK_NO_DAEMONIZE = true

你很好。

有关更多信息，请检查$ SPARK_HOME / sbin / spark-daemon.sh

# Runs a Spark command as a daemon.
#
# Environment Variables
#
#   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_HOME}/conf.
#   SPARK_LOG_DIR   Where log files are stored. ${SPARK_HOME}/logs by default.
#   SPARK_MASTER    host:path where spark code should be rsync'd from
#   SPARK_PID_DIR   The pid files are stored. /tmp by default.
#   SPARK_IDENT_STRING   A string representing this instance of spark. $USER by default
#   SPARK_NICENESS The scheduling priority for daemons. Defaults to 0.
#   SPARK_NO_DAEMONIZE   If set, will run the proposed command in the foreground. It will not output a PID file.
##

具有Apache Spark的Docker容器处于独立群集模式

2 个答案: