在我们的集群中,我们有Kafka 0.10.1和Spark 2.1.0。火花流应用程序与检查点机制(HDFS上的检查点)一起工作正常。但是,我们注意到,使用检查点时,如果代码更改,Streaming Application不会重新启动。
探索Spark Streaming文档 - 在Kafka上存储偏移量:
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself,其中说:
$ sed '/^search_string=/ s/ text"$/ add_string one add_string two&/' file
I have a file
search_string="It has some add_string one add_string two text"
some other text
在此之后,我修改了我们的代码,如下所示:
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
现在,当我尝试启动Streaming应用程序时,它无法启动并查看日志,这就是我们所看到的:
val kafkaMap:Map[String,Object] = KakfaConfigs
val stream:InputDStream[ConsumerRecord[String,String]] = KafkaUtil.createDirectStream(ssc, PreferConsistent, Subscribe[String,String] (Array("topicName"),kafkaMap))
stream.foreach { rdd =>
val offsetRangers : Array[OffsetRanger] = rdd.asInstanceOf[HasOffsetRangers].offsetRanges
// Filter out the values which have empty values and get the tuple of type
// ( topicname, stringValue_read_from_kafka_topic)
stream.map(x => ("topicName",x.value)).filter(x=> !x._2.trim.isEmpty).foreachRDD(processRDD _)
// Sometime later, after outputs have completed.
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
def processRDD(rdd:RDD[(String,String)]) {
// Process futher to hdfs
}
有人可以建议我们是否遗漏了什么?