我有一些代码将数据文件读入Spark数据帧,应该过滤掉某个字段中带空值的任何行,并写出结果。当我在Spark-shell中运行它时,一切正常。当我将它打包到jar并尝试使用spark-submit运行它时,我得到NullPointerExceptions。不知道为什么会这样。由于某种原因,空值似乎没有被我的if条件捕获,应该标记它们。我的代码如下所示:
object findAmount extends App {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val myFilter: ((String,String)=> Integer) = (tDate:String, openDate:String) => {
if (Option(openDate).isEmpty || Option(tDate).isEmpty) 99
else {
val startDt = Calendar.getInstance()
startDt.setTime(dateFormat.parse(openDate))
startDt.add(Calendar.MILLISECOND, -1)
val endDt = Calendar.getInstance()
val eventDate = Calendar.getInstance()
eventDate.setTime(dateFormat.parse(tDate))
if (eventDate.after(startDt) && eventDate.before(endDt) 1
else 0
}
}
val myFunc = udf(myFilter)
val df = sqlContext.avroFile("hdfs:/my/file", 20).withColumn("code",myFunc(col("tDate"), col("openDate"))).filter(col("code") === 1)
AvroSaver.save(df,"hdfs:/my/output/folder")
这也是我的命令行:
spark-submit --class com.myApp.findAmount --master yarn-client --num-executors 20 --executor-memory 1g --driver-memory 1g
这是错误输出的相关部分(有输出的页面和页面):
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 0.0 failed 4 times, most recent failure: Lost task 22.3 in stage 0.0 (TID 70, server.name): java.lang.NullPointerException
at com.myApp.findAmount$$anonfun$2.apply(findAmount.scala:159)
at com.myApp.findAmount$$anonfun$2.apply(findAmount.scala:154)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:62)
at org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:174)
at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:152)
at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:147)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
我认为这不是上面提到的问题的重复,因为1)它是一种不同的语言,2)因为问题不是什么是NPE,而是什么导致它所有的变量可能导致它已被明确声明,并且3)问题仍然是为什么它不会在一个环境中发生并且确实发生在具有完全相同代码的另一个环境中。
答案 0 :(得分:4)
我看到你的myFilter函数关闭了变量dateFormat
,我在你发布的代码中的任何地方都看不到它。当myFilter函数在远程执行程序中执行时,dateFormat可能为null。