Question

我在EC2集群上部署Spark数据处理作业，集群的作业很小（16个核心总共有120G RAM），最大的RDD只有76k +行。但是在中间严重偏斜（因此需要重新分区），并且每行在序列化后具有大约100k的数据。这项工作总是陷入重新分配的困境。也就是说，该作业将不断出现以下错误并重试：

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer

org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-...

我试图找出问题，但似乎机器的内存和磁盘消耗都会低于50％。我也尝试过不同的配置，包括：

let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4

但它们都不起作用。小作业总是触发相同系列的错误并最大限度地重试（最多1000次）。在这种情况下如何排除故障？

如果你有任何线索，非常感谢。

Answer 1

如果收到与此类似的错误，请检查您的日志。

ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated

每次收到此错误都是因为您丢失了执行者。为什么你失去了执行者，这是另一个故事，再次检查你的日志以获取线索。

如果Yarn认为看到你正在使用“太多的记忆”，那么Yarn可以杀死你的工作

检查以下内容：

org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl  - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.

另见：http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html

目前的技术水平是增加 spark.yarn.executor.memoryOverhead直到作业停止失败。我们确实有计划尝试根据内存量自动扩展请求，但它仍然只是一个启发式。

Answer 2

我也遇到了错误

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

在日志中找到了进一步的发现

Container killed on request. Exit code is 143

在搜索退出代码后，我意识到它主要与内存分配有关。所以我检查了我为执行程序配置的内存量。我发现错误地我已经为驱动程序配置了7g而对执行程序只配置了1g。增加执行程序的内存后，我的spark工作成功运行。

org.apache.spark.shuffle.MetadataFetchFailedException的可能原因是什么：缺少shuffle的输出位置？

2 个答案: