有一个Spark应用程序,它有两个聚合步骤,然后是一个内部联接。
聚合步骤包含仅在状态超时时才发射数据的函数:
def getSession[A, B](sessionId: String,
inputs: Iterator[A],
oldState: GroupState[A]): Iterator[B] = {
if (oldState.hasTimedOut) {
oldState.remove()
val finalState: B = ??? //get this when the event expires
Iterator(finalState)
}
else {
val aggrState: A = ???
oldState.update(aggrState)
val latestTimestamp: Long = ???
oldState.setTimeoutTimestamp(latestTimestamp, "5 seconds")
Iterator()
}
}
这在两个流中使用:
val inputStreamOne: Dataset[A] = ???
val inputStreamTwo: Dataset[A] = ???
val aggrInputStreamOneDF = {
inputStreamOne
.withWatermark("eventTimestamp", "2 minutes")
.groupByKey(_.sessionId)
.flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.EventTimeTimeout)(getSession)
}
val aggrInputStreamOneDF = {
inputStreamTwo
.withWatermark("eventTimestamp", "2 minutes")
.groupByKey(_.sessionId)
.flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.EventTimeTimeout)(getSession)
}
这些流连接在一起:
val joinedStream = {
inputStreamOne.withWatermark("inputStreamOneEventTimestamp", "4 minutes").join(
inputStreamTwo.withWatermark("inputStreamTwoEventTimestamp", "4 minutes"),
functions.expr(
"""
|inputStreamOneSessionId = inputStreamTwoSessionId AND
|inputStreamOneEventTimestamp >= inputStreamTwoEventTimestamp AND
|inputStreamOneEventTimestamp <= inputStreamTwoEventTimestamp + interval 10 seconds
|""".stripMargin),
"inner")
}
我们发现的问题是,由于Spark无法设置多个水印(在这种情况下,所有步骤的水印都设置为2分钟),在第一步发出数据时,联接上的水印也会到期,因为第一步仅在水印到期时才发射数据。
使用任意有状态操作时,是否可以将带有多个水印的查询链接起来?