10天大的Spark开发人员,试图了解Spark的flatMapGroupsWithState
API。
据我了解:
GroupStateTimeout.ProcessingTimeTimeout
,即一种用于触发以考虑处理时间而非事件时间的指令。其他是输出模式。myFunction
,该函数负责设置每个键的状态。并且我们还使用groupState.setTimeoutDuration(TimeUnit.HOURS.toMillis(4))
设置了超时时间,假设groupState是我的key的groupState的实例。据我了解,随着微批量的流数据不断涌入,spark保持了我们在用户定义函数中定义的中间状态。假设处理n
微批数据后的中间状态如下:
Key1
的状态:
{
key1: [v1, v2, v3, v4, v5]
}
key2
的状态:
{
key2: [v11, v12, v13, v14, v15]
}
对于传入的任何新数据,将使用状态为特定密钥调用myFunction
。例如。对于key1
,myFunction
用key1, new key1 values, [v1,v2,v3,v4,v5]
调用,并根据逻辑更新key1
状态。
我了解到超时的情况,发现Timeout dictates how long we should wait before timing out some intermediate state.
问题:
答案 0 :(得分:1)
如果该进程无限期运行,我的中间状态将继续堆积并达到节点上的内存限制。因此,何时清除这些中间状态。我发现在事件时间聚集的情况下,水印指示何时清除中间状态。
Apache Spark将在过期时间之后将它们标记为已过期,因此在您的示例中,不活动4个小时后(实时+ 4个小时,不活动=没有新事件更新状态)。
在处理时间的背景下,暂停中间状态意味着什么
这意味着它将根据实际时钟(处理时间,org.apache.spark.util.SystemClock
类)而超时。您可以通过分析org.apache.spark.sql.streaming.StreamingQueryManager#startQuery
triggerClock
参数来检查当前使用的时钟。
您将在FlatMapGroupsWithStateExec
类中找到更多详细信息,尤其是在这里:
// Generate a iterator that returns the rows grouped by the grouping function
// Note that this code ensures that the filtering for timeout occurs only after
// all the data has been processed. This is to ensure that the timeout information of all
// the keys with data is updated before they are processed for timeouts.
val outputIterator =
processor.processNewData(filteredIter) ++ processor.processTimedOutState()
如果分析这两种方法,您将看到:
processNewData
将映射功能应用于所有活动键(存在于微批处理中) /**
* For every group, get the key, values and corresponding state and call the function,
* and return an iterator of rows
*/
def processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow] = {
val groupedIter = GroupedIterator(dataIter, groupingAttributes, child.output)
groupedIter.flatMap { case (keyRow, valueRowIter) =>
val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow]
callFunctionAndUpdateState(
stateManager.getState(store, keyUnsafeRow),
valueRowIter,
hasTimedOut = false)
}
}
processTimedOutState
在所有过期状态上调用映射函数 def processTimedOutState(): Iterator[InternalRow] = {
if (isTimeoutEnabled) {
val timeoutThreshold = timeoutConf match {
case ProcessingTimeTimeout => batchTimestampMs.get
case EventTimeTimeout => eventTimeWatermark.get
case _ =>
throw new IllegalStateException(
s"Cannot filter timed out keys for $timeoutConf")
}
val timingOutPairs = stateManager.getAllState(store).filter { state =>
state.timeoutTimestamp != NO_TIMESTAMP && state.timeoutTimestamp < timeoutThreshold
}
timingOutPairs.flatMap { stateData =>
callFunctionAndUpdateState(stateData, Iterator.empty, hasTimedOut = true)
}
} else Iterator.empty
}
这里要注意的重要一点是,如果不调用GroupState#remove
方法,Apache Spark将在状态存储中保持过期状态。过期状态不会被返回以进行处理,因为它们被标记为NO_TIMESTAMP
字段。但是,它们将存储在状态存储delta
文件中,如果您需要重新加载最新状态,这可能会减慢重新处理的速度。如果再次分析FlatMapGroupsWithStateExec
,您将看到仅在状态removed
标记设置为true
时才删除状态:
def callFunctionAndUpdateState(...)
// ...
// When the iterator is consumed, then write changes to state
def onIteratorCompletion: Unit = {
if (groupState.hasRemoved && groupState.getTimeoutTimestamp == NO_TIMESTAMP) {
stateManager.removeState(store, stateData.keyRow)
numUpdatedStateRows += 1
} else {
val currentTimeoutTimestamp = groupState.getTimeoutTimestamp
val hasTimeoutChanged = currentTimeoutTimestamp != stateData.timeoutTimestamp
val shouldWriteState = groupState.hasUpdated || groupState.hasRemoved || hasTimeoutChanged
if (shouldWriteState) {
val updatedStateObj = if (groupState.exists) groupState.get else null
stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp)
numUpdatedStateRows += 1
}
}
}