我正在编写结构化流的代码,其中我从Kafka队列中订阅数据,然后将原始数据写回到Hbase。在这笔交易之间,我必须满足以下要求,
由于条件2和3,我不能直接使用spark提供的重复数据删除功能,因为应用水印会删除比条件3更旧的数据。
因此,为了解决这个问题,我采用了“任意状态全处理” REF:https://spark.apache.org/docs/latest/api/scala/index.html?_sm_au_=iVV0QDHnqrDVFDRMkpQ8jKtB7ckcW#org.apache.spark.sql.streaming.GroupState
我的代码如下:
要从kafka读取的代码
val kafkaIpStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
要删除重复的代码
val kafkaStream = kafkaIpStream.selectExpr("cast (key as String)", "cast (value as String)")
.withColumn("ts", split($"key", "/")(1))
.selectExpr("key as rowkey", "ts", "value as val")
.withColumn("isValid", validationUDF($"rowkey", $"ts", $"val"))
.as[inputTsRecord]
.groupByKey(_.rowkey)
.flatMapGroupsWithState(OutputMode.Update(), GroupStateTimeout.ProcessingTimeTimeout())(updateStateAccrossRecords)
.toDF("rowkey", "ts", "val", "isValid")
重复数据删除功能
case class inputTsRecord(rowkey: String, ts: String, `val`: String, isValid: String)
case class state(rowkey: String, `val`: String, insertTimestamp: Long)
def updateStateAccrossRecords(rowKey: String, inputRows: Iterator[inputTsRecord], oldState: GroupState[state]): Iterator[inputTsRecord] = {
inputRows.toSeq.toIterator.flatMap { iprow =>
println("received data for " + iprow.rowkey)
if (oldState.hasTimedOut) {
println("State timed out")
oldState.remove()
Iterator()
}
else if (oldState.exists) {
println("State exists for " + iprow.rowkey)
val timeDuration=((((System.currentTimeMillis / 1000)-oldState.get.insertTimestamp)/60)/60)
println("State not timed out for " + iprow.rowkey)
println("Duration passed " + timeDuration)
val updatedState = state(iprow.rowkey, iprow.`val`, (System.currentTimeMillis / 1000))
val isValChanged = if (updatedState.`val` == oldState.get.`val`) false else true
if (isValChanged) {
println("value changed for " + iprow.rowkey)
oldState.update(updatedState)
oldState.setTimeoutDuration("2 hours")
Iterator(iprow)
} else {
if (timeDuration >= 2)
{
println("removing state for " + iprow.rowkey)
oldState.remove()
}
println("value not changed for " + iprow.rowkey)
Iterator()
}
} else {
println("State does not exists for " + iprow.rowkey)
val newState = state(iprow.rowkey, iprow.`val`, (System.currentTimeMillis / 1000))
oldState.update(newState)
oldState.setTimeoutDuration("2 hours")
Iterator(iprow)
}
}
}
现在的问题是这个
if (timeDuration >= 2){
println("removing state for " + iprow.rowkey)
oldState.remove()
}
根据我的理解,我在流上使用 GroupStateTimeout.ProcessingTimeTimeout()时,密钥应在其到达指定的处理时间后过期。
我想念什么?
感谢帮助