在带有水印的附加模式中使用flatMapGroupWithState
时,何时将数据写入接收器?按照documentation
由于模式语义,窗口聚合的输出被withWatermark()中指定的延迟阈值延迟,行完成后(即,越过水印之后)只能将行添加到结果表中。
因此在具有附加模式的flatMapGroupWithState
中,我是否仅在组状态超时后(即,在水印被越过之后)返回数据?我的意思的代码示例-
方案1-
dataset.withWatermark("time", "1 minute")
.groupByKey(row => (row.key)
.flatMapGroupsWithState(OutputMode.Append(), GroupStateTimeout.EventTimeTimeout())(mapFunc)
def mapFunc(key: Int, data: Iterator[Rows], state: GroupState[State]): Iterator = {
var results = Iterator.Empty
if (state.hasTimedOut) {
results = state.get.iterator
state.remove()
} else {
updateState(key, data, state)
}
results
}
方案2-
dataset.withWatermark("time", "1 minute")
.groupByKey(row => (row.key)
.flatMapGroupsWithState(OutputMode.Append(), GroupStateTimeout.EventTimeTimeout())(mapFunc)
def mapFunc(key: Int, data: Iterator[Rows], state: GroupState[State]): Iterator = {
var results = Iterator.Empty
if (state.hasTimedOut) {
results = state.get.iterator
state.remove()
} else {
updateState(key, data, state)
results = state.get.iterator
}
results
}
在方案1中,我仅在GroupState
超时后才返回结果;在方案2中,我在每个触发器中都发出结果。如果使用附加输出模式,这两个有何不同?