如何在Spark Streaming + Kafka Integration时实现Exactly-once

时间:2017-07-08 12:26:34

标签: apache-spark spark-streaming

我有一个问题。有一个指南如何实现完全一个,这里是代码: https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets

val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

stream.map(record => (record.key, record.value))
 //=====================================================
 //separate line
 //=====================================================
stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

但是,如果我想使用' reduceByKeyAndWindow'在单独的行中,就像这样:

 val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)


val lines: DStream[String] = stream.map(record => record.value)
lines.map(row => {
  (row.split(",")(1), 1)
}).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(30), Seconds(5))
  .foreachRDD(rdd => {
    val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
    //mycode start
    rdd.foreach(println)
    //mycaode end
    stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
  })

我正在尝试这个,但我收到了一个错误:

Exception in thread "main" java.lang.ClassCastException: org.apache.spark.rdd.MapPartitionsRDD cannot be cast to org.apache.spark.streaming.kafka010.HasOffsetRanges

AnyHelp?提前致谢!

0 个答案:

没有答案