我收到这些消息:
16/05/22 13:33:53 ERROR YarnScheduler: Lost executor 61 on <host>: Executor heartbeat timed out after 134828 ms
16/05/22 13:33:53 WARN TaskSetManager: Lost task 25.0 in stage 12.0 (TID 2214, <host>): ExecutorLostFailure (executor 61 lost)
是否会产生替代遗嘱执行人?
答案 0 :(得分:11)
是否会产生替代遗嘱执行人?
是的,它会的。 Spark DAGScheduler
及其较低级别的集群管理器实现(Standalone,YARN或Mesos)将注意到任务失败,并将负责重新安排所述任务,作为执行的整个阶段的一部分。
DAGScheduler在Spark中做了三件事(详细解释如下):
- 计算作业的执行DAG,即阶段的DAG。
- 确定运行每项任务的首选位置。
- 处理由于随机输出文件丢失而导致的失败。
答案 1 :(得分:4)
template<>
void Test<long, double>::f(){std::cout<<"f() specialized long, double"<<std::endl;}
修复是增加16/02/27 21:37:01 ERROR cluster.YarnScheduler: Lost executor 6 on
ip-10-0-0-156.ec2.internal: remote Akka client disassociated
16/02/27 21:37:01 WARN remote.ReliableDeliverySupervisor:
Association with remote system [akka.tcp://sparkExecutor@ip-10-0-0-
156.ec2.internal:39097]
has failed, address is now gated for [5000] ms. Reason is:
[Disassociated].
16/02/27 21:37:01 INFO scheduler.TaskSetManager: Re-queueing tasks
for 6 from TaskSet 1.0
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 92), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 88), so marking it as still running
16/02/27 21:37:01 WARN scheduler.TaskSetManager: Lost task 146.0
in stage 1.0 (TID 1151, ip-10-0-0-156.ec2.internal): ExecutorLostFailure
(executor 6
lost)
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 93), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 89), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 87), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 90), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 91), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 85), so marking it as still running
16/02/27 21:37:01 INFO storage.BlockManagerMasterActor: Trying to
remove executor 6 from BlockManagerMaster.
16/02/27 21:37:02 INFO storage.BlockManagerMasterActor: Removing
block manager BlockManagerId(6, ip-10-0-0-156.ec2.internal, 34952)
16/02/27 21:37:02 INFO storage.BlockManagerMaster: Removed 6
successfully in removeExecutor
16/02/27 21:37:02 INFO scheduler.Stage: Stage 1 is now unavailable
on executor 6 (536/598, false)
16/02/27 21:37:17 INFO scheduler.TaskSetManager: Starting task
146.1 in stage 1.0 (TID 1152, ip-10-0-0-154.ec2.internal, RACK_LOCAL,
1396 bytes)
16/02/27 21:37:17 WARN scheduler.TaskSetManager: Lost task 123.0
in stage 1.0 (TID 1148, ip-10-0-0-154.ec2.internal): java.io.IOException:
Failed to
connect to ip-10-0-0-156.ec2.internal/10.0.0.156:34952
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 86), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 94), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Executor lost: 6
(epoch 0)
,直到消失为止。这可以控制JVM堆大小和数量之间的缓冲区
从YARN请求的内存(JVM可以占用超出堆的内存
尺寸)。您还要确保在YARN NodeManager中
配置,spark.yarn.executor.memoryOverhead
设置为false。
错误的原因是,它将阻止NM控制容器。
如果容器中的物理内存不足,请确保JVM堆大小足以容纳容器。
请参阅下图以更好地了解它。
JVM堆
JVM的永久生成
任何堆外分配
在大多数情况下,JVM堆的15%-30%之间的开销就足够了。您的作业配置应包括正确的JVM和容器设置。 有些工作需要更多,有些则需要更少的开销。