Question

如果我每隔5秒设置一个5秒的批处理间隔（Seconds(5)），我会为当前批处理数据添加一个标记。如果我可以为每个批处理数据添加标记，当我使用window()函数时，我可以按标记过滤数据。

前5秒输入一些数据：

hello
word
hello

为这样的数据添加标签后：

(1st, hello)     // "1st" is the custom tag that can identify this batch data
(1st, word)
(1st, hello)

第二个5秒输入一些数据：

spark
streaming
interval
time

添加数据标签后

：

(2nd, spark)
(2nd, streaming)
(2nd, interval)
(2nd, time)

Answer 1

有3个选项： -

最好的方法是在邮件本身中添加一些标识，这样当您收到邮件时，您已经拥有了可以识别每条邮件的内容。
第二个选项是创建Custom receiver，它可以识别消息Batch并添加一些标签，然后再将其发送给Spark Job。

最后一个选择是利用Accumulator。这样的事情： -

val sc = new SparkContext(conf)
var accum = sc.accumulator(0, "My Accumulator")
val recDStream = //Write Code to get the Stream
recDStream.foreachRDD(x => "Data for Batch-"+(accum+=1)+"-"+x)
//Or may be you can add Accumulator after the forEach, 
//so that it becomes for a whole Batch something like accum.add(1)

如何为火花流中的每个批量数据添加标签？

1 个答案: