如果我每隔5秒设置一个5秒的批处理间隔(Seconds(5)
),我会为当前批处理数据添加一个标记。如果我可以为每个批处理数据添加标记,当我使用window()
函数时,我可以按标记过滤数据。
前5秒输入一些数据:
hello
word
hello
为这样的数据添加标签后:
(1st, hello) // "1st" is the custom tag that can identify this batch data
(1st, word)
(1st, hello)
第二个5秒输入一些数据:
spark
streaming
interval
time
添加数据标签后:
(2nd, spark)
(2nd, streaming)
(2nd, interval)
(2nd, time)
答案 0 :(得分:1)
有3个选项: -
最后一个选择是利用Accumulator。这样的事情: -
val sc = new SparkContext(conf)
var accum = sc.accumulator(0, "My Accumulator")
val recDStream = //Write Code to get the Stream
recDStream.foreachRDD(x => "Data for Batch-"+(accum+=1)+"-"+x)
//Or may be you can add Accumulator after the forEach,
//so that it becomes for a whole Batch something like accum.add(1)