由于java.io.NotSerializableException:org.apache.spark.SparkContext,Spark作业失败

时间:2014-05-12 09:34:08

标签: java scala hadoop apache-spark

当我尝试在RDD[(Int,ArrayBuffer[(Int,Double)])]输入上应用方法(ComputeDwt)时,我面临异常。 我甚至使用extends Serialization选项来序列化spark中的对象。这是代码片段。

input:series:RDD[(Int,ArrayBuffer[(Int,Double)])] 
DWTsample extends Serialization is a class having computeDwt function.
sc: sparkContext

val  kk:RDD[(Int,List[Double])]=series.map(t=>(t._1,new DWTsample().computeDwt(sc,t._2)))

Error:
org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext
org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:556)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:503)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:361)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)

有人能告诉我可能出现什么问题以及应该采取什么措施来解决这个问题?

1 个答案:

答案 0 :(得分:15)

该行

series.map(t=>(t._1,new DWTsample().computeDwt(sc,t._2)))

引用SparkContext(sc),但SparkContext不可序列化。 SparkContext旨在公开在驱动程序上运行的操作;它不能被工人运行的代码引用/使用。

您必须重新构建代码,以便在地图功能关闭中不引用sc