hasTimedOut
方法在我的任意状态处理函数true
中永远不会是updateState
def updateWithEvent(tagCount: TagCount, inputsSize: Int): TagCount = {
TagCount(tagCount.tag, tagCount.count + inputsSize)
}
def updateState(
tag: String,
inputs: Iterator[ExtendedTweet],
oldState: GroupState[TagCount]
): Iterator[TagsStatus] = {
val state = oldState.getOption.getOrElse(TagCount(tag, 0))
val rows = inputs.toSeq.sortBy(_.createdAt.getTime)
val newState = updateWithEvent(state, rows.length)
rows.toIterator.flatMap { input =>
if (oldState.hasTimedOut) {
println("hasTimedOut is never true")
oldState.remove()
Iterator(TagsStatus(input.createdAt, input.post, input.author, input.tag, 1))
} else {
oldState.update(newState)
oldState.setTimeoutTimestamp(input.createdAt.getTime, "30 seconds")
Iterator(TagsStatus(input.createdAt, input.post, input.author, newState.tag, newState.count))
}
}
}
呼入
val memorySink = tweetsStream
.withWatermark("createdAt", "30 seconds")
.groupByKey(_.tag)
.flatMapGroupsWithState(OutputMode.Update(), GroupStateTimeout.EventTimeTimeout())(updateState)
.writeStream
.format("memory")
.outputMode("update")
.queryName("tweets_stream")
.start()
有了这些参数,我得到了以下输出:
+-------------------+--------------------+-------------------+-------+---------+
| createdAt| post| author| tag|tagsCount|
+-------------------+--------------------+-------------------+-------+---------+
|2019-07-10 18:20:22|Lots of great con...| Seth Martin|bigdata| 1|
|2019-07-10 18:20:29|DF > Machine lear...| Mr Data Scientist|bigdata| 2|
|2019-07-10 18:20:31|Samsung Galaxy Wa...| NoSQL|bigdata| 3|
|2019-07-10 18:20:42|Sunday Briefing #...| Mr Data Scientist|bigdata| 4|
|2019-07-10 18:20:42|CCA131 Demystify ...| niken|bigdata| 6|
|2019-07-10 18:20:44|Setting alerts in...| NoSQL|bigdata| 6|
|2019-07-10 18:20:47|CCA131 Demystify ...| Javascript30 Bot|bigdata| 11|
|2019-07-10 18:20:47|CCA131 Demystify ...|100 Days Of ML Code|bigdata| 11|
|2019-07-10 18:20:47|CCA131 Demystify ...| BFTawfik Bot 2|bigdata| 11|
|2019-07-10 18:20:47|CCA131 Demystify ...| CodersNotes|bigdata| 11|
|2019-07-10 18:20:47|CCA131 Demystify ...| Adrinbot|bigdata| 11|
+-------------------+--------------------+-------------------+-------+---------+
我希望在窗口超时过期时删除旧状态。并再次从1开始计数标签。不是来自先前状态的计数
答案 0 :(得分:0)
问题是您的“ if(oldState.hasTimedOut)”语句包含在flatMap中。它必须在flatMap之外。尝试像下面的示例一样更改代码:
def updateState(user :String,
inputs :Iterator[InputRecord1],
oldState :GroupState[SessionState]): Iterator[OutputRecord1] ={
var output :Iterator[OutputRecord1] = Iterator()
if(oldState.hasTimedOut){
val state = oldState.get
oldState.remove()
output = Iterator(OutputRecord1(user, state.numSum, state.last_event_time))
}else{
for(input <- inputs) {
if (!oldState.exists) {
val newState = updateSessionState(SessionState(user, 0, new Timestamp(0L)), input)
oldState.update(newState)
oldState.setTimeoutTimestamp(newState.last_event_time.getTime, "5 seconds")
} else {
val newState = updateSessionState(oldState.get, input)
oldState.update(newState)
oldState.setTimeoutTimestamp(newState.last_event_time.getTime, "5 seconds")
}
}
}
output
}
我曾经遇到过与您相同的问题,直到我这样更改代码,问题才得以解决。我想当状态超时时,spark将在没有记录传递的情况下调用方法“ updateState”一次,输入Iterator为空,因为Iterator为空,flatMap中的代码将不会执行。同时,flatMap中包含“ if(oldState.hasTimedOut)”语句,该语句永远不会运行。