Question

我正在尝试使用Spark Structured Streaming(version 2.2.0)构建会话应用程序。

如果在更新模式下使用mapGroupWithState，我知道如果状态数据变大，执行程序将崩溃并出现OOM异常。因此，我必须使用GroupStateTimeout选项管理内存。（参考How does Spark Structured Streaming handle in-memory state when state data is growing?）

但是，如果没有更多针对特定密钥的新流数据，我无法检查状态是否超时并准备好删除。

例如，我们说我有以下代码。

myDataset
  .groupByKey(_.key)
  .flatMapGroupsWithState(OutputMode.Update, GroupStateTimeout.EventTimeTimeout)(makeSession)

makeSession（）函数将检查状态是否超时并删除超时状态。

现在，让我们说关键＆＃34; foo＆＃34;已经在内存中存储了一些状态，并且没有关键字＆＃34; foo＆＃34;正在流入应用程序。因此，makeSession（）不会使用键＆＃34; foo＆＃34;来处理数据。并且未检查存储的状态。这意味着，存储状态带有键＆＃34; foo＆＃34;坚持记忆。如果有很多键，如＆＃34; foo＆＃34;，则不会刷新存储的状态，JVM会引发OOM异常。

我可能会误解mapGroupWithState，但我怀疑我的OOM例外是由上述问题引起的。

如果我是对的，那么这个案例的解决方案是什么？我想刷新已经超时的所有存储状态，并且不再有新的流数据。

有没有好的代码示例？

Answer 1

现在，让我们说关键＆＃34; foo＆＃34;已经在内存中存储了一些状态，没有新的数据与关键＆＃34; foo＆＃34;正在流入应用程序。因此，makeSession（）不会使用键＆＃34; foo＆＃34;来处理数据。并且未检查存储的状态。

这是不正确的。只要你有任何密钥的新数据，Spark就会确保每个批次验证整个密钥集，并最后一次调用超时密钥。

每次致flat/mapGroupsWithState的一部分内容，我们都有：

val outputIterator = updater.updateStateForKeysWithData(filteredIter) ++ updater.updateStateForTimedOutKeys()

这是updateStateForTimedOutKeys：

def updateStateForTimedOutKeys(): Iterator[InternalRow] = { if (isTimeoutEnabled) { val timeoutThreshold = timeoutConf match { case ProcessingTimeTimeout => batchTimestampMs.get case EventTimeTimeout => eventTimeWatermark.get case _ => throw new IllegalStateException( s"Cannot filter timed out keys for $timeoutConf") } val timingOutKeys = store.filter { case (_, stateRow) => val timeoutTimestamp = getTimeoutTimestamp(stateRow) timeoutTimestamp != NO_TIMESTAMP && timeoutTimestamp < timeoutThreshold } timingOutKeys.flatMap { case (keyRow, stateRow) => callFunctionAndUpdateState(keyRow, Iterator.empty, Some(stateRow), hasTimedOut = true) } } else Iterator.empty }

相关部分在timed out键上是flatMap，最后一次用hasTimedOut = true调用每个函数。

当不再检查状态数据时，Spark Structured Streaming如何刷新内存状态？

1 个答案: