Question

我正在编写结构化流的代码，其中我从Kafka队列中订阅数据，然后将原始数据写回到Hbase。在这笔交易之间，我必须满足以下要求，

流中的数据必须在2个小时的时间内进行重复数据删除，即，每当有新密钥的数据进入密钥时，都应在内存中保留2个小时，并且在这2个小时内的所有重复数据都不得发送到Hbase。
如果新的密钥记录进入其中，密钥已经处于状态但值已更改，则应将此类更新的记录发送到Hbase，并且密钥应在此之后保留2个小时。
无法确定数据可能到达多长时间，而传入的任何数据都将满足上述任何条件。

由于条件2和3，我不能直接使用spark提供的重复数据删除功能，因为应用水印会删除比条件3更旧的数据。

因此，为了解决这个问题，我采用了“任意状态全处理” REF：https://spark.apache.org/docs/latest/api/scala/index.html?_sm_au_=iVV0QDHnqrDVFDRMkpQ8jKtB7ckcW#org.apache.spark.sql.streaming.GroupState

我的代码如下：

要从kafka读取的代码

  val kafkaIpStream = spark.readStream.format("kafka")
      .option("kafka.bootstrap.servers", kafkaBroker)
      .option("subscribe", topic)
      .option("startingOffsets", "earliest")
      .load()

要删除重复的代码

    val kafkaStream = kafkaIpStream.selectExpr("cast (key as String)", "cast (value as String)")
      .withColumn("ts", split($"key", "/")(1))
      .selectExpr("key as rowkey", "ts", "value as val")
      .withColumn("isValid", validationUDF($"rowkey", $"ts", $"val"))
      .as[inputTsRecord]
      .groupByKey(_.rowkey)
      .flatMapGroupsWithState(OutputMode.Update(), GroupStateTimeout.ProcessingTimeTimeout())(updateStateAccrossRecords)
      .toDF("rowkey", "ts", "val", "isValid")

重复数据删除功能

case class inputTsRecord(rowkey: String, ts: String, `val`: String, isValid: String)
  case class state(rowkey: String, `val`: String, insertTimestamp: Long)

  def updateStateAccrossRecords(rowKey: String, inputRows: Iterator[inputTsRecord], oldState: GroupState[state]): Iterator[inputTsRecord] = {

    inputRows.toSeq.toIterator.flatMap { iprow =>

      println("received data for " + iprow.rowkey)

      if (oldState.hasTimedOut) {

          println("State timed out")

          oldState.remove()
          Iterator()
        }
      else if (oldState.exists) {

        println("State exists for " + iprow.rowkey)

          val timeDuration=((((System.currentTimeMillis / 1000)-oldState.get.insertTimestamp)/60)/60) 

          println("State not timed out for " + iprow.rowkey)


          println("Duration passed " + timeDuration)

          val updatedState = state(iprow.rowkey, iprow.`val`, (System.currentTimeMillis / 1000))
          val isValChanged = if (updatedState.`val` == oldState.get.`val`) false  else true

          if (isValChanged) {

            println("value changed for " + iprow.rowkey)
            oldState.update(updatedState)
            oldState.setTimeoutDuration("2 hours")

            Iterator(iprow)
          } else {
            if (timeDuration >= 2)
            {
               println("removing state for " + iprow.rowkey)
              oldState.remove()
            }
             println("value not changed for " + iprow.rowkey)
            Iterator()
          }



      } else {

        println("State does not exists for " + iprow.rowkey)

        val newState = state(iprow.rowkey, iprow.`val`, (System.currentTimeMillis / 1000))
        oldState.update(newState)
        oldState.setTimeoutDuration("2 hours")

        Iterator(iprow)
      }

    }
  }

现在的问题是这个

即使将超时指定为2小时的处理超时，密钥在指定时间[通过日志检查]后不会过期。
密钥是唯一的，即理想情况下，密钥只有1个条目整个应用程序生命周期的密钥，除非存在重复。
由于这个原因，状态最终包含了所有键随着流的进展会导致内存问题。
密钥仅在其2小时后收到数据时才过期由于以下代码而到达

 if (timeDuration >= 2){
               println("removing state for " + iprow.rowkey)
              oldState.remove()
            }

流正在接收连续数据。

根据我的理解，我在流上使用 GroupStateTimeout.ProcessingTimeTimeout（）时，密钥应在其到达指定的处理时间后过期。

我想念什么？

感谢帮助

“ GroupStateTimeout.ProcessingTimeTimeout（）”完成后，Spark结构化流中的密钥不会失效

0 个答案: