Spark程序挂起在Job完成:toArray - worker抛出java.util.concurrent.TimeoutException

时间:2014-06-06 11:35:11

标签: scala apache-spark

所以我有一个简单的火花工作,我试图找出如何将字节写入序列文件。它工作正常,然后突然间工作似乎最终挂起 - 特别是在这一行:

14/06/06 10:57:48 INFO SparkContext: Job finished: toArray at XXXX.scala:104, took 44.439736728 s

所以我看了一下工人的stderr日志,我看到了:

java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:162)
    at org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
    at org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
    at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
    at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

作业输出有一些我以前从未见过的奇怪的INFO信息:

14/06/06 11:08:28 INFO TaskSetManager: Finished TID 2 in 2163 ms on ip-172-31-23-17.ec2.internal (progress: 0/5)
14/06/06 11:08:28 INFO DAGScheduler: Completed ResultTask(1, 0)
14/06/06 11:08:30 INFO TaskSetManager: Finished TID 3 in 3635 ms on ip-172-31-29-86.ec2.internal (progress: 1/5)
14/06/06 11:08:30 INFO DAGScheduler: Completed ResultTask(1, 1)

^^ Normal output see this in jobs all the time.  But below lots of weird messages.

14/06/06 11:08:50 INFO BlockManagerMasterActor$BlockManagerInfo: Added taskresult_6 in memory on ip-172-31-30-95.ec2.internal:41661 (size: 253.9 MB, free: 2.6 GB)
14/06/06 11:08:50 INFO SendingConnection: Initiating connection to [ip-172-31-30-95.ec2.internal/172.31.30.95:41661]
14/06/06 11:08:50 INFO SendingConnection: Connected to [ip-172-31-30-95.ec2.internal/172.31.30.95:41661], 1 messages pending
14/06/06 11:08:50 INFO ConnectionManager: Accepted connection from [ip-172-31-30-95.ec2.internal/172.31.30.95]
14/06/06 11:08:52 INFO TaskSetManager: Finished TID 6 in 25831 ms on ip-172-31-30-95.ec2.internal (progress: 2/5)
14/06/06 11:08:52 INFO BlockManagerMasterActor$BlockManagerInfo: Removed taskresult_6 on ip-172-31-30-95.ec2.internal:41661 in memory (size: 253.9 MB, free: 2.9 GB)
14/06/06 11:08:53 INFO DAGScheduler: Completed ResultTask(1, 4)
14/06/06 11:08:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added taskresult_4 in memory on ip-172-31-22-58.ec2.internal:46736 (size: 329.3 MB, free: 2.6 GB)
14/06/06 11:08:57 INFO SendingConnection: Initiating connection to [ip-172-31-22-58.ec2.internal/172.31.22.58:46736]
14/06/06 11:08:57 INFO SendingConnection: Connected to [ip-172-31-22-58.ec2.internal/172.31.22.58:46736], 1 messages pending
14/06/06 11:08:57 INFO ConnectionManager: Accepted connection from [ip-172-31-22-58.ec2.internal/172.31.22.58]
14/06/06 11:09:00 INFO TaskSetManager: Finished TID 4 in 33738 ms on ip-172-31-22-58.ec2.internal (progress: 3/5)
14/06/06 11:09:00 INFO BlockManagerMasterActor$BlockManagerInfo: Removed taskresult_4 on ip-172-31-22-58.ec2.internal:46736 in memory (size: 329.3 MB, free: 2.9 GB)
14/06/06 11:09:02 INFO DAGScheduler: Completed ResultTask(1, 2)

如果我非常耐心,最终这份工作会吐出一些奇怪的东西:

14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-30-95.ec2.internal,41661)
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/9 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-172-31-30-95.ec2.internal,41661)
14/06/06 11:14:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@6b071630
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-22-58.ec2.internal,46736)
14/06/06 11:14:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-172-31-22-58.ec2.internal,46736)
14/06/06 11:14:15 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/06 11:14:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@6b071630
java.nio.channels.CancelledKeyException
    at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
    at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/9 removed: Command exited with code 50
14/06/06 11:14:15 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
java.nio.channels.ClosedChannelException
    at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
    at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
    at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, ip-172-31-30-95.ec2.internal, 41661, 0) with no recent heart beats: 132381ms exceeds 45000ms
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 9 disconnected, so removing it
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(6, ip-172-31-17-30.ec2.internal, 43082, 0) with no recent heart beats: 132382ms exceeds 45000ms
14/06/06 11:14:15 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(<driver>, ip-172-31-23-17.ec2.internal, 55101, 0) with no recent heart beats: 132385ms exceeds 45000ms
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 9 (already removed): Uncaught exception
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@3c39a92
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(8, ip-172-31-22-58.ec2.internal, 46736, 0) with no recent heart beats: 132377ms exceeds 45000ms
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/10 on worker-20140606110717-ip-172-31-21-172.ec2.internal-7078 (ip-172-31-21-172.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@3c39a92
java.nio.channels.CancelledKeyException
    at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:267)
    at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(9, ip-172-31-21-172.ec2.internal, 42635, 0) with no recent heart beats: 132384ms exceeds 45000ms
14/06/06 11:14:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@46000f2b
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/10 on hostPort ip-172-31-21-172.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-172-31-28-236.ec2.internal, 35129, 0) with no recent heart beats: 132379ms exceeds 45000ms
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/10 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/4 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/4 removed: Command exited with code 50
14/06/06 11:14:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@46000f2b
java.nio.channels.CancelledKeyException
    at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:267)
    at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 4 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 4 on ip-172-31-28-73.ec2.internal: Uncaught exception
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/11 on worker-20140606110708-ip-172-31-28-73.ec2.internal-7078 (ip-172-31-28-73.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 4 (epoch 0)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/11 on hostPort ip-172-31-28-73.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/3 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/3 removed: Command exited with code 50
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 1 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 1 on ip-172-31-30-95.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 3 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 3 (already removed): Uncaught exception
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 7 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 7 on ip-172-31-28-236.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(2, ip-172-31-23-17.ec2.internal, 44685, 0) with no recent heart beats: 132373ms exceeds 45000ms
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-172-31-24-194.ec2.internal, 47896, 0) with no recent heart beats: 132382ms exceeds 45000ms
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 5 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 5 on ip-172-31-29-86.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/12 on worker-20140606110708-ip-172-31-26-188.ec2.internal-7078 (ip-172-31-26-188.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-172-31-29-86.ec2.internal, 48078, 0) with no recent heart beats: 132380ms exceeds 45000ms
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 8 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 8 on ip-172-31-22-58.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/12 on hostPort ip-172-31-26-188.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 4 from BlockManagerMaster.
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/6 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 2 disconnected, so removing it
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/6 removed: Command exited with code 50
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 1 (epoch 1)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster.
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 2 on ip-172-31-23-17.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 7 (epoch 2)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 0 disconnected, so removing it
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 7 from BlockManagerMaster.
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 0 (already removed): remote Akka client disassociated
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 6 disconnected, so removing it
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 6 (already removed): remote Akka client disassociated
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 5 (epoch 3)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 5 from BlockManagerMaster.
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/13 on worker-20140606110717-ip-172-31-17-30.ec2.internal-7078 (ip-172-31-17-30.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/13 on hostPort ip-172-31-17-30.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 8 (epoch 4)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 8 from BlockManagerMaster.
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 8 successfully in removeExecutor
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/11 is now RUNNING
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 2 (epoch 5)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 2 from BlockManagerMaster.
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/13 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/12 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/0 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/0 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/14 on worker-20140606110706-ip-172-31-24-194.ec2.internal-7078 (ip-172-31-24-194.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/14 on hostPort ip-172-31-24-194.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/14 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/5 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/5 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/15 on worker-20140606110706-ip-172-31-29-86.ec2.internal-7078 (ip-172-31-29-86.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/15 on hostPort ip-172-31-29-86.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/15 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/1 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/1 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/16 on worker-20140606110708-ip-172-31-30-95.ec2.internal-7078 (ip-172-31-30-95.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/16 on hostPort ip-172-31-30-95.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/16 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/8 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/8 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/17 on worker-20140606110708-ip-172-31-22-58.ec2.internal-7078 (ip-172-31-22-58.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/17 on hostPort ip-172-31-22-58.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/17 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/7 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/7 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/18 on worker-20140606110706-ip-172-31-28-236.ec2.internal-7078 (ip-172-31-28-236.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/18 on hostPort ip-172-31-28-236.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/18 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/2 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/2 removed: Command exited with code 50
14/06/06 11:14:15 ERROR AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/06 11:14:15 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...

它只是再次挂起......如果我有耐心,那么它会吐出以下内容并再次挂起

14/06/06 11:14:15 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/06 11:16:54 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3, ip-172-31-26-188.ec2.internal, 55392, 0) with no recent heart beats: 159686ms exceeds 45000ms
14/06/06 11:19:42 WARN BlockManagerMaster: Error sending message to BlockManagerMaster in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:162)
    at org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
    at org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
    at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
    at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

10分钟后,我的耐心耗尽,我杀了-9它(正常的中断不起作用)。

问题是,如何将群集恢复到有效状态?似乎火花正在某个地方举行一些我们无法实现的状态。我们尝试删除了火花缓存文件,即... / spark / spark- *,我们已经尝试重新启动所有工人和主人!

更新:

我认为问题可能是我认为我正在阅读的文件在某种程度上被破坏,这意味着它变成了大约370 MB。如此大量数据的toArray可能会让事情变得疯狂。刚删除文件并再次尝试其他文件后,事情恢复正常。然而,问题仍未解决,因为抛出的行为并不是人们所期望的 - 人们只会期待漫长的等待,然后可能是OOM。

0 个答案:

没有答案