我有一个长时间运行的进程,针对具有2个工作节点的独立Spark集群执行Spark作业。最初,工作成功完成,但似乎在一两天后,一些工作将开始失败。所有失败的作业都是在同一个工作程序上运行的阶段失败的结果,它将以下序列和异常记录到stderr
日志中:
16/05/04 21:07:53 INFO MemoryStore: ensureFreeSpace(2273) called with curMem=988397261, maxMem=1159641169
16/05/04 21:07:53 INFO MemoryStore: Block broadcast_259_piece0 stored as bytes in memory (estimated size 2.2 KB, free 163.3 MB)
16/05/04 21:07:53 INFO TorrentBroadcast: Reading broadcast variable 259 took 9 ms
16/05/04 21:07:53 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block broadcast_259 in memory.
16/05/04 21:07:53 WARN MemoryStore: Not enough space to cache broadcast_259 in memory! (computed 504.0 B so far)
16/05/04 21:07:53 INFO MemoryStore: Memory use = 942.6 MB (blocks) + 162.9 MB (scratch space shared across 0 tasks(s)) = 1105.5 MB. Storage limit = 1105.9 MB.
16/05/04 21:07:53 WARN MemoryStore: Persisting block broadcast_259 to disk instead.
16/05/04 21:07:53 WARN BlockManager: Putting block broadcast_259 failed
16/05/04 21:07:53 INFO TorrentBroadcast: Started reading broadcast variable 259
16/05/04 21:07:53 ERROR Executor: Exception in task 1.0 in stage 220.0 (TID 2575)
java.io.FileNotFoundException: /tmp/aetmpdir/spark/tmp/spark-4de0f2b6-6d96-41f8-9d28-9e9d288f143a/executor-d2b171a9-377d-4857-9754-c41332ceda66/blockmgr-ed82e9b2-bc90-4069-bd89-ed0e7f57468c/28/broadcast_259 (A file or directory in the path name does not exist.)
at java.io.FileOutputStream.<init>(FileOutputStream.java:233)
at java.io.FileOutputStream.<init>(FileOutputStream.java:183)
at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:78)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:176)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:143)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:791)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:996)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:182)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1175)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
at java.lang.Thread.run(Thread.java:798)
查看该节点上的文件系统,很明显blockmgr目录中的编号子目录(在本例中为28
)不存在,这似乎是FileNotFoundException的根。但是,同一级别还有其他目录。
为了进行比较,在工作节点上,我将在日志中看到以下序列:
16/05/02 22:46:15 INFO MemoryStore: ensureFreeSpace(22869550) called with curMem=948197297, maxMem=1159641169
16/05/02 22:46:15 INFO MemoryStore: 33 blocks selected for dropping
16/05/02 22:46:15 INFO BlockManager: Dropping block broadcast_1_piece0 from memory
16/05/02 22:46:15 INFO BlockManager: Writing block broadcast_1_piece0 to disk
16/05/02 22:46:15 INFO BlockManager: Dropping block broadcast_1 from memory
16/05/02 22:46:15 INFO BlockManager: Writing block broadcast_1 to disk
16/05/02 22:46:15 INFO BlockManager: Dropping block broadcast_0_piece0 from memory
16/05/02 22:46:15 INFO BlockManager: Writing block broadcast_0_piece0 to disk
同样在工作节点中,我看到00-3f中的目录没有任何中断。
虽然两个节点的配置类似,但似乎有一些后台清理故障节点的blockmgr目录。
非常感谢任何提示或见解。
答案 0 :(得分:0)
经过进一步调查,结果证明这是由系统清理程序删除spark tmp目录位置中的空目录引起的问题。