在带有水印的Append模式下使用flatMapGroupWithState进行结构化流式传输

时间:2020-09-16 06:54:49

标签: apache-spark spark-streaming spark-structured-streaming

在带有水印的附加模式中使用flatMapGroupWithState时,何时将数据写入接收器?按照documentation

由于模式语义,窗口聚合的输出被withWatermark()中指定的延迟阈值延迟,行完成后(即,越过水印之后)只能将行添加到结果表中。

因此在具有附加模式的flatMapGroupWithState中,我是否仅在组状态超时后(即,在水印被越过之后)返回数据?我的意思的代码示例-

方案1-

dataset.withWatermark("time", "1 minute")
       .groupByKey(row => (row.key)
       .flatMapGroupsWithState(OutputMode.Append(), GroupStateTimeout.EventTimeTimeout())(mapFunc)

def mapFunc(key: Int, data: Iterator[Rows], state: GroupState[State]): Iterator = {
  var results = Iterator.Empty
  if (state.hasTimedOut) {
    results = state.get.iterator
    state.remove()
  } else {
    updateState(key, data, state)
  }
  results
}

方案2-

dataset.withWatermark("time", "1 minute")
       .groupByKey(row => (row.key)
       .flatMapGroupsWithState(OutputMode.Append(), GroupStateTimeout.EventTimeTimeout())(mapFunc)

def mapFunc(key: Int, data: Iterator[Rows], state: GroupState[State]): Iterator = {
  var results = Iterator.Empty
  if (state.hasTimedOut) {
    results = state.get.iterator
    state.remove()
  } else {
    updateState(key, data, state)
    results = state.get.iterator
  }
  results
}

在方案1中,我仅在GroupState超时后才返回结果;在方案2中,我在每个触发器中都发出结果。如果使用附加输出模式,这两个有何不同?

0 个答案:

没有答案