Spark Streaming在两个流之间共享状态

时间:2016-04-08 10:06:42

标签: apache-spark spark-streaming

我们可以在两个DStream之间共享火花流状态吗?

基本上我想使用第一个流创建/更新状态,并使用状态来丰富第二个流。

示例:我修改了StatefulNetworkWordCount示例。我正在使用第一个流创建状态,并使用第一个流的计数来丰富第二个流。

val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))


val mappingFuncForFirstStream = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
  val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
  val output = (word, sum)
  state.update(sum)

  Some(output)
}

val mappingFuncForSecondStream = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
  val sum = state.getOption.getOrElse(0)
  val output = (word, sum)

  Some(output)
}



// first stream
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
  .flatMap(r=>r._2.split(" "))
  .map(x => (x, 1))
  .mapWithState(StateSpec.function(mappingFuncForFirstStream).initialState(initialRDD).timeout(Minutes(10)))
  .print(1)



// second stream
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams2, mergeTopicSet)
  .flatMap(r=>r._2.split(" "))
  .map(x => (x, 1))
  .mapWithState(StateSpec.function(mappingFuncForSecondStream).initialState(initialRDD).timeout(Minutes(10)))
  .print(50)

在检查点目录中,我可以看到两个不同的状态RDD。

我正在使用spark-1.6.1和kafka-0.8.2.1

2 个答案:

答案 0 :(得分:2)

可以通过使用StateDStream

来访问导致应用mapWithState操作的DStream的基础stateMappedDStream.snapshotStream()

所以,灵感来自你的榜样:

val firstDStream = ???
val secondDStream = ???
val firstDStreamSMapped = firstDStream..mapWithState(...)
val firstStreamState = firstDStreamSMapped.snapshotStream()
// we want to use the state of Stream 1 to enrich Stream 2. The keys of both streams are required to match.
val enrichedStream = secondDStream.join(firstStreamState)
... do stuff with enrichedStream ...

答案 1 :(得分:-1)

  

此方法可能对您有所帮助:

    ssc.untion(Seq[Dstream[T]])