使用SparkML预测模型时的任务序列化问题

时间:2017-03-23 12:06:12

标签: scala apache-spark spark-streaming apache-spark-mllib

运行此代码时出现任务序列化错误,其中myDstreamDStream[String]sessionString

      val model = GradientBoostedTreesModel.load(sc,mySet.value("modelAddress") + mySet.value("modelId"))
      val newDstream = myDstream.map(session => {
        val features : Array[String] = UtilsPredictor.getFeatures()
        val parsedSession = UtilsPredictor.parseJSON(session)
        var input: String = ""
        var count: Integer = 1
        for (i <- 0 until features.length) {
          if (count < features.length) {
            input += parsedSession(features(i)) + ","
            count += 1
          }
          else {
            input += parsedSession(features(i))
          }
        }
        input = "[" + input + "]"
        val vecTest = Vectors.parse(input)
        parsedSession + ("prediction_result" -> model.predict(vecTest).toString)
      })


      newDstream.foreachRDD(session => {
        session.foreachPartition({ partitionOfRecords =>
            //...
        })
      })

对象UtilsPredictor是可序列化的。该问题涉及预测模型的使用。 但最奇怪的是序列化错误是由行newDstream.foreachRDD(session => {触发的。任何想法如何避免这个错误?

更新:

我尝试@transient val vectTest = Vectors.parse(input,但是再次获得相同的任务序列化错误。下面我提供错误消息。特别是,错误由行Predictor.scala:234触发,即session.foreachPartition({ partitionOfRecords =>

org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
    at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:234)
    at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:233)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)

1 个答案:

答案 0 :(得分:0)

  • 确保您的课程为extends Serializable
  • @transient添加到您怀疑它给出任务序列化错误的代码块。此注释将跳过特定实体计算/考虑序列化。

通常,这就是我们在应用程序中编写日志记录时的操作,如下所示

 @transient private lazy val log = LoggerFactory.getLogger(getClass)