Question

我们在具有8个内核和50GB内存（单个工作线程）的单个节点上运行了spark 2.1.0独立群集。

我们使用以下内存设置在群集模式下运行spark应用程序 -

--driver-memory = 7GB (default - 1core is used)
--worker-memory = 43GB (all remaining cores - 7 cores)

最近，我们经常观察执行者被驱动程序/主人杀死并重新启动。我在驱动程序下面找到了以下日志 -

17/12/14 03:29:39 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 3658237 ms exceeds timeout 3600000 ms  
17/12/14 03:29:39 ERROR TaskSchedulerImpl: Lost executor 2 on 10.150.143.81: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 23.0 in stage 316.0 (TID 9449, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 9.0 in stage 318.0 (TID 9459, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 8.0 in stage 318.0 (TID 9458, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 5.0 in stage 318.0 (TID 9455, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 7.0 in stage 318.0 (TID 9457, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms

应用程序不是那么占用内存，有几个连接和写入数据集到目录。相同的代码在spark-shell上运行而没有任何失败。

寻找群集调整或任何可以减少执行程序被杀死的配置设置。

Answer 1

首先，如果您的实例具有正好50Gb的RAM，我建议永远不要为任何应用程序分配总共50Gb的RAM。其余的系统应用程序也需要一些RAM才能工作，系统使用未被应用程序使用的RAM来缓存文件并减少磁盘读取量。 JVM本身也有一个很小的内存开销。

如果你的火花作业使用了所有内存，那么你的实例将不可避免地交换，如果它交换，它将开始表现不正确。您可以通过运行命令htop轻松检查内存使用情况并查看服务器是否正在交换。你还应该确保交换减少到0，这样它就不会交换，除非真的必须交换。

鉴于您提供的信息，我能说的全部，如果这没有帮助，您应该考虑提供更多信息，例如您的火花作业的完整确切参数。

Answer 2

执行者可能存在内存问题。所以你应该在spark-env.sh文件中配置带有执行程序内存的核心。它可以在路径~/spark/conf/spark-env.sh上找到： - 因为你的记忆总量是50 GB。

export SPARK_WORKER_CORES=8
export SPARK_WORKER_INSTANCES=5
export SPARK_WORKER_MEMORY=8G
export SPARK_EXECUTOR_INSTANCES=2

如果您的数据不是太大而无法处理，您可以在spark-default.conf中设置驱动程序内存。还给这个文件〜/ spark / conf / spark-default.conf`中的执行者一些开销内存： -

spark.executor.memoryOverhead 1G
spark.driver.memory  1G

Answer 3

使用spark-shell，驱动程序也是执行程序。看起来司机杀死了执行人，因为它没有收到1小时的心跳。通常，心跳配置为10秒。

您是否修改了默认心跳设置？
检查执行者的GC。长GC暂停是导致心跳丢失的常见原因。如果是这样，请改进执行程序中每个核心的内存。这通常意味着增加内存或减少内核。
您网络中可能导致心跳下降的任何内容？

日志清楚地表明驱动程序杀死了执行程序，因为它在1小时内没有收到任何心跳，并且执行程序在它被杀死时运行了一些任务。

Spark独立集群调优

3 个答案: