作为spark作业提交时,Spark RDD地图中的NullPointerException

时间:2016-08-17 01:22:31

标签: scala hadoop apache-spark distributed-computing bigdata

我们正在尝试提交一个火花工作(火花2.0,hadoop 2.7.2),但由于某种原因,我们在EMR中收到了相当神秘的NPE。一切都像scala程序一样运行,所以我们不确定是什么导致了这个问题。这是堆栈跟踪:

  

18:02:55,271 ERROR Utils:91 - 中止任务   显示java.lang.NullPointerException           at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.agg_doAggregateWithKeys $(未知来源)           at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown Source)           在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)           在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:370)           在scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:438)           在org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply $ mcV $ sp(WriterContainer.scala:253)           在org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252)           在org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252)           在org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)           在org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)           在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:143)           在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:143)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)           在org.apache.spark.scheduler.Task.run(Task.scala:85)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)           at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)           在java.lang.Thread.run(Thread.java:745)

据我们所知,这种情况发生在以下方法中:

def process(dataFrame: DataFrame, S3bucket: String) = {
  dataFrame.map(row =>
      "text|label"
  ).coalesce(1).write.mode(SaveMode.Overwrite).text(S3bucket)
}

我们已将其缩小到地图功能,因为这在作为火花作业提交时有效:

def process(dataFrame: DataFrame, S3bucket: String) = {
  dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).text(S3bucket)
}

有谁知道可能导致此问题的原因是什么?另外,我们如何解决它?我们很难过。

1 个答案:

答案 0 :(得分:6)

我认为当工作人员试图访问仅存在于驱动程序而非工作人员的NullPointerException对象时,工作人员会抛出SparkContext

coalesce()重新分配您的数据。当您仅请求一个分区时,它将尝试在一个分区 * 中挤压所有数据。这可能会给你的应用程序的内存占用带来很大的压力。

通常,最好不要仅在1中缩小分区。

有关详情,请参阅:Spark NullPointerException with saveAsTextFilethis