在Spark作业执行期间,我经常遇到以下问题。即使是DataProc集群也是空闲的Spark驱动程序正在杀死所有执行程序,我最终得到了一个执行程序。
18/02/26 08:47:05 INFO spark.ExecutorAllocationManager: Existing executor 35 has been removed (new total is 2)
18/02/26 08:50:40 INFO scheduler.TaskSetManager: Finished task 189.0 in stage 5.0 (TID 6002) in 569184 ms on dse-dev-dataproc-w-53.c.k-ddh-lle.internal (executor 57) (499/500)
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 57
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Removing executor 57 because it has been idle for 60 seconds (new desired total will be 1)
18/02/26 08:51:42 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 57.
18/02/26 08:51:42 INFO scheduler.DAGScheduler: Executor lost: 57 (epoch 7)
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 57 from BlockManagerMaster.
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(57, dse-dev-dataproc-w-53.c.k-ddh-lle.internal, 53072, None)
18/02/26 08:51:42 INFO storage.BlockManagerMaster: Removed 57 successfully in removeExecutor
18/02/26 08:51:42 INFO cluster.YarnScheduler: Executor 57 on dse-dev-dataproc-w-53.c.k-ddh-lle.internal killed by driver.
18/02/26 08:51:42 INFO spark.ExecutorAllocationManager: Existing executor 57 has been removed (new total is 1)
完成上述任务后,当它启动下一个阶段时,它会查找随机数据。
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.190:42676
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.157:45812
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.177:53612
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.166:41901
此窗口将暂停一段时间,任务失败,但出现以下异常。
18/02/26 09:12:33 INFO BlockManagerMaster: Removal of executor 21 requested
18/02/26 09:12:33 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
18/02/26 00:12:33 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
18/02/26 09:12:33 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我的作业执行者设置:
spark.driver.maxResultSize 3840m
spark.driver.memory 16G
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors 100
spark.dynamicAllocation.minExecutors 1
spark.executor.cores 6
spark.executor.id driver
spark.executor.instances 20
spark.executor.memory 30G
spark.hadoop.yarn.timeline-service.enabled False
spark.shuffle.service.enabled true
spark.sql.catalogImplementation hive
spark.sql.parquet.cacheMetadata false