Google Dataproc - 经常与执行者断开连接

时间:2016-01-20 10:13:06

标签: apache-spark google-cloud-dataproc

我正在使用Dataproc使用spark-shell在集群上运行Spark命令。我经常收到错误/警告消息,表明我失去了与执行者的联系。消息如下所示:

[Stage 6:>                                                          (0 + 2) / 2]16/01/20 10:10:24 ERROR     org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor:  Association with remote system [akka.tcp://sparkExecutor@spark-cluster-  femibyte-w-0.c.gcebook-1039.internal:60599] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.2 in stage 6.0 (TID 17, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.2 in stage 6.0 (TID 16, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)

...

这是另一个样本:

20 10:51:43 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 2 on spark-cluster-femibyte-w-1.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:51:43 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster-femibyte-w-1.c.gcebook-1039.internal:58745] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 5, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.ExecutorAllocationManager:  Attempted to mark unknown executor 2 idle

这是正常的吗?有什么办法可以防止这种情况发生吗?

1 个答案:

答案 0 :(得分:4)

如果作业本身没有失败,事实上您没有看到与实际任务失败相关的其他传播错误(至少就我可以从问题中发布的内容判断),您很可能是只看到无害但known to be spammy issue in core Spark;这里的关键是Spark动态分配在作业期间放弃未充分利用的执行程序,并根据需要重新分配它们。他们最初未能压制遗嘱执行人丢失的部分,但我们已经进行了测试,以确保它对实际工作没有任何不良影响。

这里是a googlegroups thread,重点介绍了Spark on YARN的一些行为细节。

要检查是否确实是动态分配导致消息,请尝试运行:

spark-shell --conf spark.dynamicAllocation.enabled=false \
    --conf spark.executor.instances=99999

或者如果您通过gcloud beta dataproc jobs提交工作,那么:

gcloud beta dataproc jobs submit spark \
    --properties spark.dynamicAllocation.enabled=false,spark.executor.instances=99999

如果你真的看到网络打嗝或其他Dataproc错误解除了主人/工人与应用程序端OOM之间的关联,你可以直接通过dataproc-feedback@google.com向Dataproc团队发送电子邮件; beta不会成为潜在破坏行为的借口(当然我们希望能够排除在测试期间我们可能尚未发现的棘手的边缘案例错误。)