执行人遗失时会发生什么?

时间:2016-05-22 17:39:13

标签: apache-spark

我收到这些消息:

16/05/22 13:33:53 ERROR YarnScheduler: Lost executor 61 on <host>: Executor heartbeat timed out after 134828 ms
16/05/22 13:33:53 WARN TaskSetManager: Lost task 25.0 in stage 12.0 (TID 2214, <host>): ExecutorLostFailure (executor 61 lost)

是否会产生替代遗嘱执行人?

2 个答案:

答案 0 :(得分:11)

  

是否会产生替代遗嘱执行人?

是的,它会的。 Spark DAGScheduler及其较低级别的集群管理器实现(Standalone,YARN或Mesos)将注意到任务失败,并将负责重新安排所述任务,作为执行的整个阶段的一部分。

DAGScheduler

  

DAGScheduler在Spark中做了三件事(详细解释如下):

     
      
  • 计算作业的执行DAG,即阶段的DAG。
  •   
  • 确定运行每项任务的首选位置。
  •   
  • 处理由于随机输出文件丢失而导致的失败。
  •   

有关详情,请访问Advanced Spark TutorialMastering Apache Spark

答案 1 :(得分:4)

是。它会尝试重新提交丢失的执行者,并试图重播。请参见下面的日志..

template<>
void Test<long, double>::f(){std::cout<<"f() specialized long, double"<<std::endl;}

修复是增加16/02/27 21:37:01 ERROR cluster.YarnScheduler: Lost executor 6 on ip-10-0-0-156.ec2.internal: remote Akka client disassociated 16/02/27 21:37:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@ip-10-0-0- 156.ec2.internal:39097] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 16/02/27 21:37:01 INFO scheduler.TaskSetManager: Re-queueing tasks for 6 from TaskSet 1.0 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 92), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 88), so marking it as still running 16/02/27 21:37:01 WARN scheduler.TaskSetManager: Lost task 146.0 in stage 1.0 (TID 1151, ip-10-0-0-156.ec2.internal): ExecutorLostFailure (executor 6 lost) 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 93), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 89), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 87), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 90), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 91), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 85), so marking it as still running 16/02/27 21:37:01 INFO storage.BlockManagerMasterActor: Trying to remove executor 6 from BlockManagerMaster. 16/02/27 21:37:02 INFO storage.BlockManagerMasterActor: Removing block manager BlockManagerId(6, ip-10-0-0-156.ec2.internal, 34952) 16/02/27 21:37:02 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor 16/02/27 21:37:02 INFO scheduler.Stage: Stage 1 is now unavailable on executor 6 (536/598, false) 16/02/27 21:37:17 INFO scheduler.TaskSetManager: Starting task 146.1 in stage 1.0 (TID 1152, ip-10-0-0-154.ec2.internal, RACK_LOCAL, 1396 bytes) 16/02/27 21:37:17 WARN scheduler.TaskSetManager: Lost task 123.0 in stage 1.0 (TID 1148, ip-10-0-0-154.ec2.internal): java.io.IOException: Failed to connect to ip-10-0-0-156.ec2.internal/10.0.0.156:34952 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 86), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted ShuffleMapTask(1, 94), so marking it as still running 16/02/27 21:37:01 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0) ,直到消失为止。这可以控制JVM堆大小和数量之间的缓冲区 从YARN请求的内存(JVM可以占用超出堆的内存 尺寸)。您还要确保在YARN NodeManager中 配置,spark.yarn.executor.memoryOverhead设置为false。 错误的原因是,它将阻止NM控制容器。 如果容器中的物理内存不足,请确保JVM堆大小足以容纳容器。

请参阅下图以更好地了解它。

enter image description here 容器大小应足够大,以包含:

  • JVM堆

  • JVM的永久生成

  • 任何堆外分配

在大多数情况下,JVM堆的15%-30%之间的开销就足够了。您的作业配置应包括正确的JVM和容器设置。 有些工作需要更多,有些则需要更少的开销。