flatMapGroupWithState

时间:2019-07-10 15:30:59

标签: scala apache-spark spark-streaming state spark-structured-streaming

hasTimedOut方法在我的任意状态处理函数true中永远不会是updateState

def updateWithEvent(tagCount: TagCount, inputsSize: Int): TagCount = {
  TagCount(tagCount.tag, tagCount.count + inputsSize)
}

def updateState(
                 tag: String,
                 inputs: Iterator[ExtendedTweet],
                 oldState: GroupState[TagCount]
               ): Iterator[TagsStatus] = {

  val state = oldState.getOption.getOrElse(TagCount(tag, 0))
  val rows = inputs.toSeq.sortBy(_.createdAt.getTime)
  val newState = updateWithEvent(state, rows.length)

  rows.toIterator.flatMap { input =>
    if (oldState.hasTimedOut) {
      println("hasTimedOut is never true")
      oldState.remove()
      Iterator(TagsStatus(input.createdAt, input.post, input.author, input.tag, 1))
    } else {
      oldState.update(newState)
      oldState.setTimeoutTimestamp(input.createdAt.getTime, "30 seconds")
      Iterator(TagsStatus(input.createdAt, input.post, input.author, newState.tag, newState.count))
    }
  }
}

呼入

val memorySink = tweetsStream
  .withWatermark("createdAt", "30 seconds")
  .groupByKey(_.tag)
  .flatMapGroupsWithState(OutputMode.Update(), GroupStateTimeout.EventTimeTimeout())(updateState)
  .writeStream
  .format("memory")
  .outputMode("update")
  .queryName("tweets_stream")
  .start()

有了这些参数,我得到了以下输出:

+-------------------+--------------------+-------------------+-------+---------+
|          createdAt|                post|             author|    tag|tagsCount|
+-------------------+--------------------+-------------------+-------+---------+
|2019-07-10 18:20:22|Lots of great con...|        Seth Martin|bigdata|        1|
|2019-07-10 18:20:29|DF > Machine lear...|  Mr Data Scientist|bigdata|        2|
|2019-07-10 18:20:31|Samsung Galaxy Wa...|              NoSQL|bigdata|        3|
|2019-07-10 18:20:42|Sunday Briefing #...|  Mr Data Scientist|bigdata|        4|
|2019-07-10 18:20:42|CCA131 Demystify ...|              niken|bigdata|        6|
|2019-07-10 18:20:44|Setting alerts in...|              NoSQL|bigdata|        6|
|2019-07-10 18:20:47|CCA131 Demystify ...|   Javascript30 Bot|bigdata|       11|
|2019-07-10 18:20:47|CCA131 Demystify ...|100 Days Of ML Code|bigdata|       11|
|2019-07-10 18:20:47|CCA131 Demystify ...|     BFTawfik Bot 2|bigdata|       11|
|2019-07-10 18:20:47|CCA131 Demystify ...|        CodersNotes|bigdata|       11|
|2019-07-10 18:20:47|CCA131 Demystify ...|           Adrinbot|bigdata|       11|
+-------------------+--------------------+-------------------+-------+---------+

我希望在窗口超时过期时删除旧状态。并再次从1开始计数标签。不是来自先前状态的计数

1 个答案:

答案 0 :(得分:0)

问题是您的“ if(oldState.hasTimedOut)”语句包含在flatMap中。它必须在flatMap之外。尝试像下面的示例一样更改代码:

  def updateState(user :String,
                        inputs :Iterator[InputRecord1],
                         oldState :GroupState[SessionState]): Iterator[OutputRecord1] ={

    var output :Iterator[OutputRecord1] = Iterator()

    if(oldState.hasTimedOut){ 
      val state = oldState.get
      oldState.remove()
      output = Iterator(OutputRecord1(user, state.numSum, state.last_event_time))

    }else{

        for(input <- inputs) {
          if (!oldState.exists) {
            val newState = updateSessionState(SessionState(user, 0, new Timestamp(0L)), input)
            oldState.update(newState)
            oldState.setTimeoutTimestamp(newState.last_event_time.getTime, "5 seconds")

          } else {
            val newState = updateSessionState(oldState.get, input)
            oldState.update(newState)
            oldState.setTimeoutTimestamp(newState.last_event_time.getTime, "5 seconds")

          }
        }

    }


    output
  }

我曾经遇到过与您相同的问题,直到我这样更改代码,问题才得以解决。我想当状态超时时,spark将在没有记录传递的情况下调用方法“ updateState”一次,输入Iterator为空,因为Iterator为空,flatMap中的代码将不会执行。同时,flatMap中包含“ if(oldState.hasTimedOut)”语句,该语句永远不会运行。