Spark ExecutorLostFailure

时间:2015-11-11 16:26:33

标签: apache-spark

我试图在集群模式下在mesos上运行spark 1.5。我能够启动调度程序并运行spark-submit。但是当我这样做时,火花驱动器会因以下原因失败:

I1111 16:21:33.515130 25325 fetcher.cpp:414] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/2bbe0c3b-433b-45e0-938b-f4d4532df129-S29","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/home\/optimus.prime\/Test.jar"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/2bbe0c3b-433b-45e0-938b-f4d4532df129-S29\/frameworks\/2bbe0c3b-433b-45e0-938b-f4d4532df129-0114\/executors\/driver-20151111162132-0036\/runs\/f0e8f4d7-35cb-4b73-bb5f-1112de2d8156"}
I1111 16:21:33.516376 25325 fetcher.cpp:369] Fetching URI '/home/optimus.prime/Test.jar'
I1111 16:21:33.516388 25325 fetcher.cpp:243] Fetching directly into the sandbox directory
I1111 16:21:33.516407 25325 fetcher.cpp:180] Fetching URI '/home/optimus.prime/Test.jar'
I1111 16:21:33.516417 25325 fetcher.cpp:160] Copying resource with command:cp '/home/optimus.prime/Test.jar' '/tmp/mesos/slaves/2bbe0c3b-433b-45e0-938b-f4d4532df129-S29/frameworks/2bbe0c3b-433b-45e0-938b-f4d4532df129-0114/executors/driver-20151111162132-0036/runs/f0e8f4d7-35cb-4b73-bb5f-1112de2d8156/Test.jar'
W1111 16:21:33.619190 25325 fetcher.cpp:265] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: /home/optimus.prime/Test.jar
I1111 16:21:33.619221 25325 fetcher.cpp:446] Fetched '/home/optimus.prime/Test.jar' to '/tmp/mesos/slaves/2bbe0c3b-433b-45e0-938b-f4d4532df129-S29/frameworks/2bbe0c3b-433b-45e0-938b-f4d4532df129-0114/executors/driver-20151111162132-0036/runs/f0e8f4d7-35cb-4b73-bb5f-1112de2d8156/Test.jar'
I1111 16:21:33.769359 25335 exec.cpp:134] Version: 0.25.0
I1111 16:21:33.774183 25341 exec.cpp:208] Executor registered on slave 2bbe0c3b-433b-45e0-938b-f4d4532df129-S29
WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
15/11/11 16:21:34 INFO SparkContext: Running Spark version 1.5.1
15/11/11 16:21:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/11 16:21:35 INFO SecurityManager: Changing view acls to: root
15/11/11 16:21:35 INFO SecurityManager: Changing modify acls to: root
15/11/11 16:21:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/11/11 16:21:36 INFO Slf4jLogger: Slf4jLogger started
15/11/11 16:21:36 INFO Remoting: Starting remoting
15/11/11 16:21:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.241.10.12:36818]
15/11/11 16:21:36 INFO Utils: Successfully started service 'sparkDriver' on port 36818.
15/11/11 16:21:36 INFO SparkEnv: Registering MapOutputTracker
15/11/11 16:21:36 INFO SparkEnv: Registering BlockManagerMaster
15/11/11 16:21:37 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-2e733585-81ae-45ad-b81d-f2b977e38153
15/11/11 16:21:37 INFO MemoryStore: MemoryStore started with capacity 1069.1 MB
15/11/11 16:21:37 INFO HttpFileServer: HTTP File server directory is /tmp/spark-bbd7944b-7ffc-4911-a51b-5bed4e174fad/httpd-f94199aa-972d-4724-ad9e-f237401c6bab
15/11/11 16:21:37 INFO HttpServer: Starting HTTP Server
15/11/11 16:21:37 INFO Utils: Successfully started service 'HTTP file server' on port 53947.
15/11/11 16:21:37 INFO SparkEnv: Registering OutputCommitCoordinator
15/11/11 16:21:37 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/11/11 16:21:37 INFO SparkUI: Started SparkUI at http://10.241.10.12:4040
15/11/11 16:21:37 INFO SparkContext: Added JAR file:/mnt/mesos/sandbox/Test.jar at http://10.241.10.12:53947/jars/Test.jar with timestamp 1447258897676
15/11/11 16:21:37 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
I1111 16:21:37.906981    96 sched.cpp:164] Version: 0.25.0
2015-11-11 16:21:37,907:9(0x7f67d2d3c700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-11-11 16:21:37,907:9(0x7f67d2d3c700):ZOO_INFO@log_env@716: Client environment:host.name=mesos-slaves-spark-bjrg
2015-11-11 16:21:37,907:9(0x7f67d2d3c700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-11-11 16:21:37,907:9(0x7f67d2d3c700):ZOO_INFO@log_env@724: Client environment:os.arch=3.19.0-33-generic
2015-11-11 16:21:37,907:9(0x7f67d2d3c700):ZOO_INFO@log_env@725: Client environment:os.version=#38~14.04.1-Ubuntu SMP Fri Nov 6 18:17:28 UTC 2015
2015-11-11 16:21:37,907:9(0x7f67d2d3c700):ZOO_INFO@log_env@733: Client environment:user.name=(null)
2015-11-11 16:21:37,907:9(0x7f67d2d3c700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-11-11 16:21:37,908:9(0x7f67d2d3c700):ZOO_INFO@log_env@753: Client environment:user.dir=/opt/spark
2015-11-11 16:21:37,908:9(0x7f67d2d3c700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.241.10.3:2181,10.241.10.4:2181,110.241.10.5:2181 sessionTimeout=10000 watcher=0x7f67dc7e3600 sessionId=0 sessionPasswd=<null> context=0x7f67ec021650 flags=0
2015-11-11 16:21:37,915:9(0x7f67d1438700):ZOO_INFO@check_events@1703: initiated connection to server [10.241.10.3:2181]
2015-11-11 16:21:37,917:9(0x7f67d1438700):ZOO_INFO@check_events@1750: session establishment complete on server [10.241.10.3:2181], sessionId=0x150a0c4f8a720bd, negotiated timeout=10000
I1111 16:21:37.917933    91 group.cpp:331] Group process (group(1)@10.241.10.12:59519) connected to ZooKeeper
I1111 16:21:37.918011    91 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1111 16:21:37.918088    91 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I1111 16:21:37.919067    91 detector.cpp:156] Detected a new leader: (id='11')
I1111 16:21:37.919288    91 group.cpp:674] Trying to get '/mesos/json.info_0000000011' in ZooKeeper
I1111 16:21:37.919922    91 detector.cpp:481] A new leading master (UPID=master@10.241.10.4:5050) is detected
I1111 16:21:37.920075    91 sched.cpp:262] New master detected at master@10.241.10.4:5050
I1111 16:21:37.920300    91 sched.cpp:272] No credentials provided. Attempting to register without authentication
I1111 16:21:37.926208    88 sched.cpp:641] Framework registered with 2bbe0c3b-433b-45e0-938b-f4d4532df129-0163
15/11/11 16:21:37 INFO MesosSchedulerBackend: Registered as framework ID 2bbe0c3b-433b-45e0-938b-f4d4532df129-0163
15/11/11 16:21:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57551.
15/11/11 16:21:38 INFO NettyBlockTransferService: Server created on 57551
15/11/11 16:21:38 INFO BlockManagerMaster: Trying to register BlockManager
15/11/11 16:21:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.241.10.12:57551 with 1069.1 MB RAM, BlockManagerId(driver, 10.241.10.12, 57551)
15/11/11 16:21:38 INFO BlockManagerMaster: Registered BlockManager
15/11/11 16:21:39 INFO SparkContext: Starting job: sumApprox at Test.scala:21
15/11/11 16:21:39 INFO DAGScheduler: Got job 0 (sumApprox at Test.scala:21) with 8 output partitions
15/11/11 16:21:39 INFO DAGScheduler: Final stage: ResultStage 0(sumApprox at Test.scala:21)
15/11/11 16:21:39 INFO DAGScheduler: Parents of final stage: List()
15/11/11 16:21:39 INFO DAGScheduler: Missing parents: List()
15/11/11 16:21:39 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at numericRDDToDoubleRDDFunctions at Test.scala:21), which has no missing parents
15/11/11 16:21:39 INFO MemoryStore: ensureFreeSpace(1760) called with curMem=0, maxMem=1120995901
15/11/11 16:21:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1760.0 B, free 1069.1 MB)
15/11/11 16:21:39 INFO MemoryStore: ensureFreeSpace(1151) called with curMem=1760, maxMem=1120995901
15/11/11 16:21:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1151.0 B, free 1069.1 MB)
15/11/11 16:21:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.241.10.12:57551 (size: 1151.0 B, free: 1069.1 MB)
15/11/11 16:21:39 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
15/11/11 16:21:39 INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at numericRDDToDoubleRDDFunctions at Test.scala:21)
15/11/11 16:21:39 INFO TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
15/11/11 16:21:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.241.10.15, PROCESS_LOCAL, 2053 bytes)
15/11/11 16:21:39 INFO TaskSetManager: Re-queueing tasks for 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from TaskSet 0.0
15/11/11 16:21:39 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.241.10.15): ExecutorLostFailure (executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 lost)
15/11/11 16:21:39 INFO DAGScheduler: Executor lost: 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 (epoch 0)
15/11/11 16:21:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from BlockManagerMaster.
15/11/11 16:21:39 INFO BlockManagerMaster: Removed 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 successfully in removeExecutor
15/11/11 16:21:39 INFO DAGScheduler: Host added was in lost list earlier: 10.241.10.15
15/11/11 16:21:39 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 10.241.10.15, PROCESS_LOCAL, 2053 bytes)
15/11/11 16:21:40 INFO TaskSetManager: Re-queueing tasks for 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from TaskSet 0.0
15/11/11 16:21:40 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, 10.241.10.15): ExecutorLostFailure (executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 lost)
15/11/11 16:21:40 INFO DAGScheduler: Executor lost: 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 (epoch 1)
15/11/11 16:21:40 INFO BlockManagerMasterEndpoint: Trying to remove executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from BlockManagerMaster.
15/11/11 16:21:40 INFO BlockManagerMaster: Removed 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 successfully in removeExecutor
15/11/11 16:21:40 INFO DAGScheduler: Host added was in lost list earlier: 10.241.10.15
15/11/11 16:21:40 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, 10.241.10.15, PROCESS_LOCAL, 2053 bytes)
15/11/11 16:21:40 INFO TaskSetManager: Re-queueing tasks for 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from TaskSet 0.0
15/11/11 16:21:40 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2, 10.241.10.15): ExecutorLostFailure (executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 lost)
15/11/11 16:21:40 INFO DAGScheduler: Executor lost: 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 (epoch 2)
15/11/11 16:21:40 INFO BlockManagerMasterEndpoint: Trying to remove executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from BlockManagerMaster.
15/11/11 16:21:40 INFO BlockManagerMaster: Removed 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 successfully in removeExecutor
15/11/11 16:21:40 INFO DAGScheduler: Host added was in lost list earlier: 10.241.10.15
15/11/11 16:21:40 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, 10.241.10.15, PROCESS_LOCAL, 2053 bytes)
15/11/11 16:21:40 INFO TaskSetManager: Re-queueing tasks for 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from TaskSet 0.0
15/11/11 16:21:40 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, 10.241.10.15): ExecutorLostFailure (executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 lost)
15/11/11 16:21:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
15/11/11 16:21:40 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/11/11 16:21:40 INFO TaskSchedulerImpl: Cancelling stage 0
15/11/11 16:21:40 INFO DAGScheduler: ResultStage 0 (sumApprox at Test.scala:21) failed in 0.713 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.241.10.15): ExecutorLostFailure (executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 lost)
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/11/11 16:21:40 INFO DAGScheduler: Executor lost: 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 (epoch 3)
15/11/11 16:21:40 INFO SparkContext: Invoking stop() from shutdown hook
15/11/11 16:21:40 INFO BlockManagerMasterEndpoint: Trying to remove executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 from BlockManagerMaster.
15/11/11 16:21:40 INFO BlockManagerMaster: Removed 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 successfully in removeExecutor
15/11/11 16:21:40 INFO DAGScheduler: Host added was in lost list earlier: 10.241.10.15
15/11/11 16:21:40 INFO SparkUI: Stopped Spark web UI at http://10.241.10.12:4040
15/11/11 16:21:40 INFO DAGScheduler: Stopping DAGScheduler
I1111 16:21:40.447157   108 sched.cpp:1771] Asked to stop the driver
I1111 16:21:40.447325    87 sched.cpp:1040] Stopping framework '2bbe0c3b-433b-45e0-938b-f4d4532df129-0163'
15/11/11 16:21:40 INFO MesosSchedulerBackend: driver.run() returned with code DRIVER_STOPPED
15/11/11 16:21:40 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/11/11 16:21:40 INFO MemoryStore: MemoryStore cleared
15/11/11 16:21:40 INFO BlockManager: BlockManager stopped
15/11/11 16:21:40 INFO BlockManagerMaster: BlockManagerMaster stopped
15/11/11 16:21:40 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/11/11 16:21:40 INFO SparkContext: Successfully stopped SparkContext
15/11/11 16:21:40 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/11/11 16:21:40 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/11/11 16:21:40 INFO ShutdownHookManager: Shutdown hook called
15/11/11 16:21:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-bbd7944b-7ffc-4911-a51b-5bed4e174fad

此外,由于我使用的是docker,我已经搜索了应该执行任务的slave的日志,我得到了:

root@bfa1a77de2af:/opt/spark# exit
exit

关于错误的任何想法?

由于

6 个答案:

答案 0 :(得分:7)

我遇到了类似的问题,并使用了一些反复试验来找到原因和解决方案。我可能无法提供“真实”的理由,但尝试以下方式可以帮助您解决问题。

尝试使用内存和核心参数启动spark-shell:

spark-shell 
--driver-memory=2g 
--executor-memory=7g 
--num-executors=8 
--executor-cores=4 
--conf "spark.storage.memoryFraction=1" // important
--conf "spark.akka.frameSize=200" // keep it sufficiently high, maybe higher than 100 is a good thing
--conf "spark.default.parallelism=100" 
--conf "spark.core.connection.ack.wait.timeout=600" 
--conf "spark.yarn.executor.memoryOverhead=2048" // (in mb) not really valid for shell, but good thing for spark-submit
--conf "spark.yarn.driver.memoryOverhead=400" // not really valid for shell, but good thing for spark-submit. minimum 384 (in mb)

现在,如果总内存(驱动程序内存+ num执行程序*执行程序内存)超出可用内存,则会抛出错误。我相信你的情况并非如此。

执行者核心,保持小,比如2或4。

执行器内存=(总内存 - 驱动程序内存)/执行程序数...实际上少了一点。

  • 尝试增加执行程序数量,同时减少执行程序内存 保持记忆力。
  • 一旦spark-shell启动,请转到 在工作监督中的工作,并检查'​​执行者'选项卡,你可以 即使你把20个执行者放进去,也只有10个被创造出来。 这表明你能走多远。
  • 减少数量 执行者到最大数量以下的合适数字并改变 'executor memory'参数相应。
  • 一旦你到达遗嘱执行人 你放入火花壳的数字,你得到的相同 执行人数,你“差不多好”。

接下来是在spark-shell提示符下运行代码,并检查Executors选项卡中使用了多少内存。

  • 如果您发现最后几个“收集”步骤花费了大量时间,则执行程序内存需要增加。
  • 如果增加执行程序内存超出了我们之前计算的限制,那么减少执行程序的数量并为每个执行程序分配更多内存。

我所理解的(经验虽然)是,可能会出现以下类型的问题:

  • 长时间运行的reduce / shuffle操作,执行超时
  • 长时间运行的线程创建无响应的演员
  • 没有足够的akka​​框架来监视过多的线程(任务)

我希望这可以帮助您获得正确的配置。设置完成后,您可以在提交spark提交作业时使用相同的配置。

注意:我得到了一个拥有大量资源限制的集群,并且有多个用户以特殊方式使用它。使资源不确定,因此计算必须处于“更安全”的限制。这导致了大量的迭代实验。

答案 1 :(得分:2)

几乎总是在我遇到执行人失去的时候。 Spark中的故障增加了更多内存解决了这些问题。尝试增加您可以传递给spark-submit的--executor-memory和/或--driver-memory选项的值。

答案 2 :(得分:1)

由于许多不同的原因,您的遗嘱执行人可能会迷路,但您获得(和展示)的信息不足以理解原因。

即使我没有在群集模式下使用Mesos的经验,在我看来,您显示为执行程序日志的内容在某种程度上是不完整的:如果您可以获得完整的日志,您将看到它们对确定原因很有帮助这样的失败。我看了看:

http://mesos.apache.org/documentation/latest/configuration/

你应该从stderr获取你正在寻找的日志(也许你只是显示他们的stdout?)。您还可以尝试使用参数--log_dir=VALUE转储日志并更好地了解情况。

答案 3 :(得分:0)

在事件日志或UI中检查大GC时间。如果你有一个持久化,删除它可以为你的执行者释放更多的内存(以不止一次运行阶段为代价)。如果您正在使用广播,请查看是否可以减少其占用空间。或者只是添加更多内存。

答案 4 :(得分:0)

ExecutorLostFailure(遗嘱执行人2bbe0c3b-433b-45e0-938b-f4d4532df129-S31丢失) 当任务失败时,因为它正在运行的执行程序丢失了。这可能是因为任务崩溃了JVM。

答案 5 :(得分:0)

设置并行度编号很有帮助。尝试使用以下参数提高群集中的并行度:

--conf "spark.default.parallelism=100"