我正在Spark Streaming应用程序中读取来自Kafka的消息。
SparkBatch持续时间:15秒。 SparkWindow:60秒。
var dstream = KafkaUtils.createDirectStream() // ignore the argumtns
var windowedStream = dstream.window(SparkWindow)
// delete data from REDIS
windowedStream.foreachRDD(rdd -> {
if(!rdd.isEmpty()) {
JavaFutureAction<Void> v = rdd.foreachPartitionAsync(t -> {
// collect error data across partitions and write those to REDIS
})// foreachPartitionAsync ends
}
})
// fetchFromREDISAndProcess() --Once foreachRDD ends. fetch error data from REDIS and process them
有一个矛盾之处,我必须首先在火花窗口中从每个分区和RDD收集错误记录,然后在驱动程序上对其进行处理。
我将在每个Spark Window中获得4个RDD。
问题: 我想在每个窗口之后从REDIS中读取数据,并在继续下一个窗口之前对其进行处理。有没有办法确保每次火花窗口结束时都执行我的代码?
答案 0 :(得分:0)
您可以使用以下逻辑:::
var dstream = KafkaUtils.createDirectStream() // ignore the argumtns
var windowedStream = dstream.window(SparkWindow)
// delete data from REDIS
var partitions = 4;
var currentPart = 0;
windowedStream.foreachRDD(rdd -> {
if(!rdd.isEmpty()) {
JavaFutureAction<Void> v = rdd.foreachPartitionAsync(t -> {
// collect error data across partitions and write those to REDIS
})// foreachPartitionAsync ends
if(++currentPart % partitions == 0) //It will be true at every 4th RDD where window will end.
//Read data from REDIS and process here as after this new window will start.
}
})
// fetchFromREDISAndProcess() --Once foreachRDD ends. fetch error data from REDIS and process them