Question

我是Spark的新手。我有一个Spark作业，该作业在1个主服务器和8个核心的Amazon EMR集群上运行。简而言之，Spark作业从S3读取一些.csv文件，将它们转换为RDD，在RDD上执行一些相对复杂的联接，最后在S3上生成其他.csv文件。在EMR集群上执行的这项工作过去大约需要5个小时。突然之间，其中之一开始，它开始花费超过30个小时，此后一直如此。输入（S3文件）没有明显区别。

我已经检查了日志，在漫长的运行中（30小时），我可以看到有关OutOfMemory错误的信息：

java.lang.OutOfMemoryError: Java heap space
        at java.util.IdentityHashMap.resize(IdentityHashMap.java:472)
        at java.util.IdentityHashMap.put(IdentityHashMap.java:441)
        at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:174)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:225)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:224)
        at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201)
        at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69)
....

        at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

尽管明显存在OutOfMemory异常，但输出（S3文件）看起来还是不错的，因此Spark作业显然可以正常完成。

什么可能突然导致执行时间从5小时跳到30小时？您将如何调查此类问题？

Answer 1

火花在失败时重试。您的流程失败。发生这种情况时，所有活动任务都可能被视为失败，因此在集群中的其他地方重新排队。

EMR上的Spark作业突然耗时30小时（原为5小时）

1 个答案: