状态管理不可序列化

时间:2017-01-04 09:14:33

标签: scala apache-spark streaming state

在我的应用程序中,我想跟踪多个状态。因此,我尝试将整个状态管理逻辑封装在类StateManager中,如下所示:

@SerialVersionUID(xxxxxxxL)
class StateManager(
    inputStream: DStream[(String, String)],
    initialState: RDD[(String, String)]
) extends Serializable {
  lazy val state = inputStream.mapWithState(stateSpec).map(_.get)
  lazy val stateSpec = StateSpec
    .function(trackStateFunc _)
    .initialState(initialState)
    .timeout(Seconds(30))
  def trackStateFunc(key: String, value: Option[String], state: State[String]): Option[(String, String)] = {}
}

object StateManager { def apply(dstream: DStream[(String, String)], initialstate: RDD[(String, String)]) = new StateManager(_dStream, _initialState) }

@SerialVersionUID(xxxxxxxL) ... extends Serializable试图解决我的问题。

但是从我的主要类中调用StateManager时,如下所示:

val lStreamingContext = StreamingEnvironment(streamingWindow, checkpointDirectory)
val statemanager= StateManager(lStreamingEnvironment.sparkContext, 1, None)
val state= statemanager.state(lKafkaStream)

state.foreachRDD(_.foreach(println))

(见下文StreamingEnvironment),我得到:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
[...]
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized  possibly as a part of closure of an RDD operation. This is because  the DStream object is being referred to from within the closure.  Please rewrite the RDD operation inside this DStream to avoid this.  This has been enforced to avoid bloating of Spark tasks  with unnecessary objects.

错误很明显,但我仍然无法触及它触发的位置。

它在哪里触发? 我该怎么做才能解决这个问题并拥有一个可重复使用的课程?

可能有用的StreamingEnvironment课程:

class StreamingEnvironment(mySparkConf: SparkConf, myKafkaConf: KafkaConf, myStreamingWindow: Duration, myCheckPointDirectory: String) {
  val sparkContext = spark.SparkContext.getOrCreate(mySparkConf)
  lazy val streamingContext = new StreamingContext(sparkContext , mMicrobatchPeriod)

  streamingContext.checkpoint(mCheckPointDirectory)
  streamingContext.remember(Minutes(1))

  def stream() = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, myKafkaConf.mBrokers, myKafkaConf.mTopics)
}

object StreamingEnvironment {
  def apply(streamingWindow: Duration, checkpointDirectory: String) = {
    //setup sparkConf and kafkaConf

    new StreamingEnvironment(sparkConf , kafkaConf, streamingWindow, checkpointDirectory)
  }
}

1 个答案:

答案 0 :(得分:0)

当我们将方法提升为函数时,对父类的outer引用将成为该函数引用的一部分,如下所示:function(trackStateFunc _) 直接将trackStateFunc声明为函数(即val)可能会解决问题。

另请注意,标记班级Serializable并不能让它神奇地成为现实。 DStream不可序列化,应注释为@transient,这也可能解决问题。