运行此代码时出现任务序列化错误,其中myDstream
为DStream[String]
且session
为String
:
val model = GradientBoostedTreesModel.load(sc,mySet.value("modelAddress") + mySet.value("modelId"))
val newDstream = myDstream.map(session => {
val features : Array[String] = UtilsPredictor.getFeatures()
val parsedSession = UtilsPredictor.parseJSON(session)
var input: String = ""
var count: Integer = 1
for (i <- 0 until features.length) {
if (count < features.length) {
input += parsedSession(features(i)) + ","
count += 1
}
else {
input += parsedSession(features(i))
}
}
input = "[" + input + "]"
val vecTest = Vectors.parse(input)
parsedSession + ("prediction_result" -> model.predict(vecTest).toString)
})
newDstream.foreachRDD(session => {
session.foreachPartition({ partitionOfRecords =>
//...
})
})
对象UtilsPredictor
是可序列化的。该问题涉及预测模型的使用。
但最奇怪的是序列化错误是由行newDstream.foreachRDD(session => {
触发的。任何想法如何避免这个错误?
更新:
我尝试@transient val vectTest = Vectors.parse(input
,但是再次获得相同的任务序列化错误。下面我提供错误消息。特别是,错误由行Predictor.scala:234
触发,即session.foreachPartition({ partitionOfRecords =>
:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:234)
at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:233)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
答案 0 :(得分:0)
extends Serializable
@transient
添加到您怀疑它给出任务序列化错误的代码块。此注释将跳过特定实体计算/考虑序列化。通常,这就是我们在应用程序中编写日志记录时的操作,如下所示
@transient private lazy val log = LoggerFactory.getLogger(getClass)