I am using Kafka and Spark 2.4.5 Structured Streaming.I am doing the average operation.but i am facing issues due to getting duplicate records from the Kafka topic in a current batch.
For example ,Kafka topic message received on 1st batch batch on update mode
car,Brand=Honda,speed=110,1588569015000000000
car,Brand=ford,speed=90,1588569015000000000
car,Brand=Honda,speed=80,15885690150000000000
here the result is average on car brand per timestamp
i.e groupby on 1588569015000000000 and Brand=Honda , the result we got
110+90/2 = 100
now second message received late data with the duplicate message with same timestamp
car,Brand=Honda,speed=50,1588569015000000000
car,Brand=Honda,speed=50,1588569015000000000
i am expecting average should update to 110+90+50/3 = 83.33
but result update to 110+90+50+50/4=75,which is wrong
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1") // Both topics on same stream!
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as data")
按时间戳和品牌分组
使用检查点写入kafka
如何使用Spark结构化流来执行此操作或代码中有任何错误?
答案 0 :(得分:1)
火花结构化流允许使用dropDuplicates
在流数据帧上进行重复数据删除。您需要指定字段以标识重复记录,并且在批次之间,spark将仅考虑每个组合的第一条记录,而具有重复值的记录将被丢弃。
下面的代码段将对品牌,速度和时间戳组合中的流数据帧进行重复数据删除。
rawDataStream.dropDuplicates("Brand", "speed", "timestamp")
请参阅spark文档here