在转换中使用函数会导致“不可序列化”异常吗?

时间:2018-06-29 21:57:24

标签: scala function apache-spark matrix serializable

我有一个Breeze DenseMatrix,我发现每行mean和每行mean个正方形,然后将它们放在另一个DenseMatrix中,每列一个。但是我得到Task Not Serializable例外。我知道sc不是Serializable,但我认为例外是因为我在安全区的转换中调用了函数。

我是对的吗?没有任何功能,怎么可能做到呢?任何帮助都会很棒!

代码:

object MotitorDetection {
case class MonDetect() extends Serializable {

var sc: SparkContext = _
var machines: Int=0
var counters: Int=0
var GlobalVec= BDM.zeros[Double](counters, 2)

def findMean(a: BDM[Double]): BDV[Double] = {
  var c = mean(a(*, ::))
  c}

def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
  val m = BDM.zeros[Double](C,2)
  m(::, 0) := x
  m(::, 1) := y
  m}

def SafeZones(stream: DStream[(Int, BDM[Double])]){

  stream.foreachRDD { (rdd: RDD[(Int, BDM[Double])], _) =>
    if (isEmpty(rdd) == false) {

      val InputVec = rdd.map(x=> (x._1, toMatrix(findMean(x._2), findMean(pow(x._2, 2)), counters)))
      GlobalMeanVector(InputVec)
    }}}

例外:

org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.map(RDD.scala:369)
        at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:85)
        at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:82)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
        at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748) Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext Serialization stack:
        - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@6eee7027)
        - field (class: ScalaApps.MotitorDetection$MonDetect, name: sc, type: class org.apache.spark.SparkContext)
        - object (class ScalaApps.MotitorDetection$MonDetect, MonDetect())
        - field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect)
        - object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, <function2>)
        - field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1)
        - object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
        ... 28 more

1 个答案:

答案 0 :(得分:0)

findMean方法是对象MotitorDetection的方法。对象MotitorDetection的板载SparkContext不可序列化。因此,rdd.map中使用的任务无法序列化。

将所有与矩阵相关的函数移到一个单独的可序列化对象MatrixUtils中,例如:

object MatrixUtils {
  def findMean(a: BDM[Double]): BDV[Double] = {
    var c = mean(a(*, ::))
    c
  }

  def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
    val m = BDM.zeros[Double](C,2)
    m(::, 0) := x
    m(::, 1) := y
    m
  }

  ...
}

,然后仅使用rdd.map(...)中的那些方法:

object MotitorDetection {
  val sc = ...

  def SafeZones(stream: DStream[(Int, BDM[Double])]){
    import MatrixUtils._

    ... = rdd.map( ... )

  }
}