Spark java.lang.NullPointerException当在foreach迭代器内部过滤火花数据帧时发生错误

时间:2018-12-02 05:35:57

标签: java scala apache-spark

我有两个spark df,我想为一个df做foreach迭代器,并从下一个df中获取特定的someId相关记录。

每次执行此操作时都会出现java.lang.NullPointerException,

我在foreach循环中发布了带有注释的代码。我尝试了3种方法来做到这一点,但是每次都发生相同的错误。

请帮助我解决此问题。

val schListDf = spark.read.format("csv")
.option("header", "true")
.load("/home/user/projects/scheduled.csv")

schListDf.createOrReplaceTempView(“预定”)

 val trsListDf = spark.read.format("csv")
.option("header", "true")
.load("/home/user/projects/transaction.csv")

trsListDf.createOrReplaceTempView(“交易”)

//THIS WORK FINE

val df3 = spark.sql(“从交易限制5中选择*”)。show()

schListDf.foreach(row => {
if(row(2) != null){

  // I HAVE TRIED THIS WAY FIRST, BUT OCCURRED SAME ERROR
  val df = spark.sql("select * from transaction where  someid = '"+row(2)+"'")

  // I HAVE TRIED THIS WAY SECOND(WITHOUT someID filter), BUT OCCURRED SAME ERROR
  val df2 = spark.sql("select * from transaction limit 5")

  // I HAVE TRIED THIS WAY ALSO(FILTER WITH DF), BUT OCCURRED SAME ERROR
  val filteredDataListDf = trsListDf.filter($"someid" === row(2))
}

})

  

18/12/02 10:36:34错误执行器:阶段4.0(TID 4)中的任务0.0异常   java.lang.NullPointerException           在org.apache.spark.sql.SparkSession.sessionState $ lzycompute(SparkSession.scala:142)           在org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:140)           在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:52)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:48)           在scala.collection.Iterator $ class.foreach(Iterator.scala:891)           在scala.collection.AbstractIterator.foreach(Iterator.scala:1334)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)           在org.apache.spark.scheduler.Task.run(Task.scala:109)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)   18/12/02 10:36:34错误执行器:阶段4.0(TID 7)中的任务3.0异常   java.lang.NullPointerException           在org.apache.spark.sql.SparkSession.sessionState $ lzycompute(SparkSession.scala:142)           在org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:140)           在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:52)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:48)           在scala.collection.Iterator $ class.foreach(Iterator.scala:891)           在scala.collection.AbstractIterator.foreach(Iterator.scala:1334)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)           在org.apache.spark.scheduler.Task.run(Task.scala:109)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)   18/12/02 10:36:34错误执行器:阶段4.0(TID 5)任务1.0中的异常

     

java.lang.NullPointerException           在org.apache.spark.sql.SparkSession.sessionState $ lzycompute(SparkSession.scala:142)           在org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:140)           在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:52)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:48)           在scala.collection.Iterator $ class.foreach(Iterator.scala:891)           在scala.collection.AbstractIterator.foreach(Iterator.scala:1334)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)           在org.apache.spark.scheduler.Task.run(Task.scala:109)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)   18/12/02 10:36:34错误执行器:阶段4.0(TID 6)中的任务2.0异常   java.lang.NullPointerException           在org.apache.spark.sql.SparkSession.sessionState $ lzycompute(SparkSession.scala:142)           在org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:140)           在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:52)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:48)           在scala.collection.Iterator $ class.foreach(Iterator.scala:891)           在scala.collection.AbstractIterator.foreach(Iterator.scala:1334)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)           在org.apache.spark.scheduler.Task.run(Task.scala:109)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)   18/12/02 10:36:34警告TaskSetManager:在阶段4.0(TID 6,本地主机,执行程序驱动程序)中丢失了任务2.0:java.lang.NullPointerException           在org.apache.spark.sql.SparkSession.sessionState $ lzycompute(SparkSession.scala:142)           在org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:140)           在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:52)           在controllers.FileProcess $$ anonfun $ hnbFile $ 1.apply(FileProcess.scala:48)           在scala.collection.Iterator $ class.foreach(Iterator.scala:891)           在scala.collection.AbstractIterator.foreach(Iterator.scala:1334)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1 $ anonfun $ apply $ 28.apply(RDD.scala:921)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2074)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)           在org.apache.spark.scheduler.Task.run(Task.scala:109)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)

1 个答案:

答案 0 :(得分:0)

Spark的某些方面与驱动程序有关。

无法从表示执行器端的foreach中访问DF。

这就是范例。同样适用于RDD和Spark Session。

也就是说,foreach很好,但是不能使用val DF或spark.sql。例如,您将需要一个while循环。

这是一个常见的误解,当它以Spark开头时就会出现。

相关问题