Spark序列化错误

时间:2015-10-21 06:07:17

标签: apache-spark

我对继承的代码有一些问题:

for ((catAttribs, (dataIter, queryIter)) <- localCollection) {
   println("in the loop")
  val bCastData = dataIter
   val spQueries = sc.parallelize(queryIter.toSeq, numReducers)
   val a = spQueries.count
   println("the value of a is: ")
   println(a)
   val type3AboveResult = spQueries.mapPartitions(queryPartitionIter => extractType3Results( bCastData, queryPartitionIter.toIterable,    kLikelihood)).flatMap(x => x)

}

println("up to here now")

我得到的错误信息是:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable           at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)  at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)        at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635)            at org.dave.examples.NN$$anonfun$main$2.apply(NN.scala:374)            at org.dave.examples.NN$$anonfun$main$2.apply(NN.scala:367)            at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.dave.examples.NN$.main(NN.scala:367)            at org.dave.examples.NN.main(NN.scala)            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)            at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)    Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext    Serialization stack:
    - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@5fe3ed23)
    - field (class: org.dave.examples.NN$$anonfun$main$2, name: sc$1, type: class org.apache.spark.SparkContext)               
    - object 

(class org.dave.examples.NN$$anonfun$main$2, <function1>)
    - field (class: org.dave.examples.NN$$anonfun$main$2$$anonfun$25, name: $outer, type: class     org.dave.examples.NN$$anonfun$main$2)
    - object (class org.dave.examples.NN$$anonfun$main$2$$anonfun$25, <function1>)
    - field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$3, type: interface scala.Function1)
    - object (class org.apache.spark.rdd.RDD$$anonfun$14, <function3>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
    ... 20 more

如果我排除以下行,代码就会运行:

val type3AboveResult = spQueries.mapPartitions(queryPartitionIter => extractType3Results( bCastData, queryPartitionIter.toIterable, kLikelihood)).flatMap(x => x)

这适用于本地模式,但不适用于群集模式。任何帮助赞赏。我在下面添加了一些代码,其中显示了程序的结构。

object NN{
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Nearest Neighbours")
    val sc = new SparkContext(conf)

    ....<exluded non relevant code>

    class Person(numerAttribs : Array[Double], tfn : String, targetAttribs: Array[Double]) extends Serializable {

    ...<exluded non relevant code>

      def extractType3Results(dataIter : Iterable[Person], queryIter : Iterable[Person], kInner : Int) : Iterator[List[List[Person]]] =  {
        val dataA = dataIter.map(person => (person.toTupleNumeric, person)).toArray
        val kdMap = KDTreeMap.fromSeq(dataA)
        var resultList = List[List[Person]]()
        var qIter = queryIter.iterator
        while (qIter.hasNext) {
          val curQuery = qIter.next;
          // we are looking for kInner + 1 since the data index may contain the query object
          val kNNresult = kdMap.findNearest(curQuery.toTupleNumeric, kInner+1)
          val filteredResult = kNNresult.filter{case (attribs, person) => person.TFN != curQuery.TFN}
          if (filteredResult.size == kInner) {
            resultList = (curQuery :: filteredResult.toList.map{case(attribs, person) => person}) :: resultList
           }
          else {
            resultList = (curQuery:: filteredResult.toList.map{case(attribs, person) => person}.sortWith((a, b) => a.distComp                (b,curQuery)).take(kInner)) :: resultList
          }
        }
    return Iterator.single(resultList)
    }//end extractType3Results

  }//end class person

     val rawData = sc.textFile("hdfs://......W_org.csv",150).map(_.split (" ")).cache()
val dataRDD = rawData.map(a=> (getCategoricalAttributes(a,indexes(0)), new Person(getStdDoubleAttributes(a,indexes    (1),attsMean,attsStdev), a(0),getDoubleAttributes(a,indexes(2)) ))).coalesce(150)


val queryRDD = dataRDD.cache()
val coGrouped = dataRDD.cogroup(queryRDD, numMappers).cache
val type1ResultCogroups = coGrouped.filter{case (catAttribs, (dataIter, queryIter)) => (dataIter.count(x => true) >=  peerCodeThreshold) && (dataIter.count(x => true) <= kLikelihood)}
val type2ResultCogroups = coGrouped.filter{case (catAttribs, (dataIter, queryIter)) => (dataIter.count(x => true) > kLikelihood) && (dataIter.count(x => true) <= linearSearchThreshold) }
val type3BelowResultCogroups = coGrouped.filter{case (catAttribs, (dataIter, queryIter)) => (dataIter.count(x => true) > Math.max(kLikelihood,linearSearchThreshold)) && (queryIter.count(x => true) <= querySplitThreshold )}
val type3AboveResultCogroups = coGrouped.filter{case (catAttribs, (dataIter, queryIter)) => (dataIter.count(x => true) > Math.max(kLikelihood,linearSearchThreshold)) && (queryIter.count(x => true) > querySplitThreshold )}

val type1Result = type1ResultCogroups.flatMap{case (catAttribs, (dataIter, queryIter)) => extractType1Results(dataIter, queryIter).flatten}
val type2Result = type2ResultCogroups.flatMap{case (catAttribs, (dataIter, queryIter)) => extractType2Results(dataIter, queryIter, kLikelihood).flatten}
val type3BelowResult = type3BelowResultCogroups.flatMap{case (catAttribs, (dataIter, queryIter)) => extractType3Results(dataIter, queryIter, kLikelihood).flatten}
var finalResult = type3BelowResult.union(type2Result).union(type1Result).coalesce(numReducers) 
val localCollection = type3AboveResultCogroups.collect

for ((catAttribs, (dataIter, queryIter)) <- localCollection) {
  val bCastData = dataIter //sc.broadcast(dataIter)
  val spQueries = sc.parallelize(queryIter.toSeq, numReducers)
  val type3AboveResult = spQueries.mapPartitions(queryPartitionIter => extractType3Results( bCastData, queryPartitionIter.toIterable,       kLikelihood)).flatMap(x => x)
  finalResult = finalResult.union(type3AboveResult)
 }
val finalResult1 = finalResult.cache()                                      

} //结束主要     } //结束对象

0 个答案:

没有答案