Scala Spark结构化流接收重复消息

时间:2020-07-07 04:53:51

标签: scala apache-spark apache-spark-sql spark-streaming spark-structured-streaming

I am using Kafka and Spark 2.4.5 Structured Streaming.I am doing the average operation.but i am facing issues due to getting duplicate records from the Kafka topic in a current batch.


For example ,Kafka topic message received on 1st batch batch on update mode

car,Brand=Honda,speed=110,1588569015000000000
car,Brand=ford,speed=90,1588569015000000000
car,Brand=Honda,speed=80,15885690150000000000

here the result is average on car brand per timestamp
i.e groupby on  1588569015000000000 and Brand=Honda , the result we got 
110+90/2 = 100

now second message received late data with the duplicate message with same timestamp
car,Brand=Honda,speed=50,1588569015000000000
car,Brand=Honda,speed=50,1588569015000000000

i am expecting average should update to 110+90+50/3 = 83.33
but result update to 110+90+50+50/4=75,which is wrong
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1") // Both topics on same stream!
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as data")

按时间戳和品牌分组

使用检查点写入kafka

如何使用Spark结构化流来执行此操作或代码中有任何错误?

1 个答案:

答案 0 :(得分:1)

火花结构化流允许使用dropDuplicates在流数据帧上进行重复数据删除。您需要指定字段以标识重复记录,并且在批次之间,spark将仅考虑每个组合的第一条记录,而具有重复值的记录将被丢弃。

下面的代码段将对品牌,速度和时间戳组合中的流数据帧进行重复数据删除。

rawDataStream.dropDuplicates("Brand", "speed", "timestamp")

请参阅spark文档here