我需要使用以前训练过的机器学习模型来进行预测。但是,我需要在foreachRDD
内进行预测,因为输入数据vecTest
会通过不同的转换和if-then
规则传递。为了避免序列化问题,我尝试使用广播。我的代码如下。不过,我仍然得到序列化错误。任何帮助都非常欢迎。
val model = GradientBoostedTreesModel.load(sc,pathToModel)
val model_sc = sc.broadcast(model)
myDSTREAM.foreachRDD(rdd => {
rdd.foreachPartition({ partitionOfRecords =>
//...
val prediction_result = model_sc.value.predict(vecTest)
})
})
更新:
我尝试使用Kryo序列化,但仍然没有成功。
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[GradientBoostedTreesModel]))
更新:
如果我运行此代码,我会收到错误(请参阅下面的stacktrace):
myDSTREAM.foreachRDD(rdd => {
rdd.foreachPartition({ partitionOfRecords =>
val model = GradientBoostedTreesModel.load(sc,pathToModel)
partitionOfRecords.foreach(s => {
//...
val vecTestRDD = sc.parallelize(Seq(vecTest))
val prediction_result = model.predict(vecTestRDD)
})
})
})
17/03/17 13:11:00 ERROR JobScheduler: Error running job streaming job 1489752660000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:210)
at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:209)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.test.classifier.Predictor
Serialization stack:
- object not serializable (class: org.test.classifier.Predictor, value: org.test.classifier.Predictor@26e949f7)
- field (class: org.test.classifier.Predictor$$anonfun$run$2, name: $outer, type: class org.test.classifier.Predictor)
- object (class org.test.classifier.Predictor$$anonfun$run$2, <function1>)
- field (class: org.test.classifier.Predictor$$anonfun$run$2$$anonfun$apply$4, name: $outer, type: class org.test.classifier.Predictor$$anonfun$run$2)
- object (class org.test.classifier.Predictor$$anonfun$run$2$$anonfun$apply$4, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
更新3:
我尝试了另一种方法,但同样的问题:
val model = GradientBoostedTreesModel.load(sc,mySet.value("modelAddress") + mySet.value("modelId"))
val new_dstream = myDStream.map(session => {
val features : Array[String] = UtilsPredictor.getFeatures()
val parsedSession = UtilsPredictor.parseJSON(session)
var input: String = ""
var count: Integer = 1
for (i <- 0 until features.length) {
if (count < features.length) {
input += parsedSession(features(i)) + ","
count += 1
}
else {
input += parsedSession(features(i))
}
}
input = "[" + input + "]"
val vecTest = Vectors.parse(input)
parsedSession + ("prediction_result" -> model.predict(vecTest).toString)
})