问题在Kafka中为Spark Streaming应用程序存储偏移量

时间:2017-10-13 13:11:10

标签: apache-kafka spark-streaming spark-streaming-kafka

在我们的集群中,我们有Kafka 0.10.1和Spark 2.1.0。火花流应用程序与检查点机制(HDFS上的检查点)一起工作正常。但是,我们注意到,使用检查点时,如果代码更改,Streaming Application不会重新启动。

探索Spark Streaming文档 - 在Kafka上存储偏移量:

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself,其中说:

$ sed '/^search_string=/ s/ text"$/ add_string one add_string two&/' file
I have a file
search_string="It has some add_string one add_string two text"
some other text

在此之后,我修改了我们的代码,如下所示:

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

现在,当我尝试启动Streaming应用程序时,它无法启动并查看日志,这就是我们所看到的:

val kafkaMap:Map[String,Object] = KakfaConfigs

val stream:InputDStream[ConsumerRecord[String,String]] = KafkaUtil.createDirectStream(ssc, PreferConsistent, Subscribe[String,String] (Array("topicName"),kafkaMap))

stream.foreach { rdd =>
    val offsetRangers : Array[OffsetRanger] = rdd.asInstanceOf[HasOffsetRangers].offsetRanges

    // Filter out the values which have empty values and get the tuple of type 
        // ( topicname, stringValue_read_from_kafka_topic)
    stream.map(x => ("topicName",x.value)).filter(x=> !x._2.trim.isEmpty).foreachRDD(processRDD _)

    // Sometime later, after outputs have completed.
    stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}


def processRDD(rdd:RDD[(String,String)]) {
 // Process futher to hdfs 
}

有人可以建议我们是否遗漏了什么?

0 个答案:

没有答案