Question

我在具有4GB RAM和2个内核的计算机上以Spark Standalone模式加载20 GB的文件，进行一些处理，然后尝试使用saveAsTextFile将结果（用于测试目的）保存到文本文件中。

如果我手动从原始输入文件中提取几千行并在其上运行代码，它就像一个魅力，产生预期的part-xxxxx文件。

但是，如果我提供整个20GB文件作为输入，它将启动正常，然后在整个过程中挂起，当让它在一夜之间运行时，它将在早上失败，并显示以下消息：

Py4JJavaError: An error occurred while calling o219.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

有没有人知道为什么会这样？

Apache Spark：Master删除了我们的应用程序：在大型RDD上使用saveAsTextFile时失败

0 个答案: