Question

10天大的Spark开发人员，试图了解Spark的flatMapGroupsWithState API。

据我了解：

我们向其传递了2个超时配置选项。可能的值是GroupStateTimeout.ProcessingTimeTimeout，即一种用于触发以考虑处理时间而非事件时间的指令。其他是输出模式。
我们传入一个函数，例如myFunction，该函数负责设置每个键的状态。并且我们还使用groupState.setTimeoutDuration(TimeUnit.HOURS.toMillis(4))设置了超时时间，假设groupState是我的key的groupState的实例。

据我了解，随着微批量的流数据不断涌入，spark保持了我们在用户定义函数中定义的中间状态。假设处理n微批数据后的中间状态如下：

Key1的状态：

{
  key1: [v1, v2, v3, v4, v5]
}

key2的状态：

{
   key2: [v11, v12, v13, v14, v15]
}

对于传入的任何新数据，将使用状态为特定密钥调用myFunction。例如。对于key1，myFunction用key1, new key1 values, [v1,v2,v3,v4,v5]调用，并根据逻辑更新key1状态。

我了解到超时的情况，发现Timeout dictates how long we should wait before timing out some intermediate state.

问题：

如果此进程无限期运行，我的中间状态将继续堆积并达到节点上的内存限制。因此，何时清除这些中间状态。我发现在事件时间聚集的情况下，水印指示何时清除中间状态。
在处理时间的上下文中超时中间状态意味着什么。

Answer 1

如果该进程无限期运行，我的中间状态将继续堆积并达到节点上的内存限制。因此，何时清除这些中间状态。我发现在事件时间聚集的情况下，水印指示何时清除中间状态。

Apache Spark将在过期时间之后将它们标记为已过期，因此在您的示例中，不活动4个小时后（实时+ 4个小时，不活动=没有新事件更新状态）。

在处理时间的背景下，暂停中间状态意味着什么

这意味着它将根据实际时钟（处理时间，org.apache.spark.util.SystemClock类）而超时。您可以通过分析org.apache.spark.sql.streaming.StreamingQueryManager#startQuery triggerClock参数来检查当前使用的时钟。

您将在FlatMapGroupsWithStateExec类中找到更多详细信息，尤其是在这里：

// Generate a iterator that returns the rows grouped by the grouping function
// Note that this code ensures that the filtering for timeout occurs only after
// all the data has been processed. This is to ensure that the timeout information of all
// the keys with data is updated before they are processed for timeouts.
val outputIterator =
  processor.processNewData(filteredIter) ++ processor.processTimedOutState()

如果分析这两种方法，您将看到：

processNewData将映射功能应用于所有活动键（存在于微批处理中）

    /**
     * For every group, get the key, values and corresponding state and call the function,
     * and return an iterator of rows
     */
    def processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow] = {
      val groupedIter = GroupedIterator(dataIter, groupingAttributes, child.output)
      groupedIter.flatMap { case (keyRow, valueRowIter) =>
        val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow]
        callFunctionAndUpdateState(
          stateManager.getState(store, keyUnsafeRow),
          valueRowIter,
          hasTimedOut = false)
      }
    }

processTimedOutState在所有过期状态上调用映射函数

    def processTimedOutState(): Iterator[InternalRow] = {
      if (isTimeoutEnabled) {
        val timeoutThreshold = timeoutConf match {
          case ProcessingTimeTimeout => batchTimestampMs.get
          case EventTimeTimeout => eventTimeWatermark.get
          case _ =>
            throw new IllegalStateException(
              s"Cannot filter timed out keys for $timeoutConf")
        }
        val timingOutPairs = stateManager.getAllState(store).filter { state =>
          state.timeoutTimestamp != NO_TIMESTAMP && state.timeoutTimestamp < timeoutThreshold
        }
        timingOutPairs.flatMap { stateData =>
          callFunctionAndUpdateState(stateData, Iterator.empty, hasTimedOut = true)
        }
      } else Iterator.empty
    }

这里要注意的重要一点是，如果不调用GroupState#remove方法，Apache Spark将在状态存储中保持过期状态。过期状态不会被返回以进行处理，因为它们被标记为NO_TIMESTAMP字段。但是，它们将存储在状态存储delta文件中，如果您需要重新加载最新状态，这可能会减慢重新处理的速度。如果再次分析FlatMapGroupsWithStateExec，您将看到仅在状态removed标记设置为true时才删除状态：

def callFunctionAndUpdateState(...)
  // ...
  // When the iterator is consumed, then write changes to state
  def onIteratorCompletion: Unit = {
  if (groupState.hasRemoved && groupState.getTimeoutTimestamp == NO_TIMESTAMP) {
    stateManager.removeState(store, stateData.keyRow)
    numUpdatedStateRows += 1
  } else {
    val currentTimeoutTimestamp = groupState.getTimeoutTimestamp
    val hasTimeoutChanged = currentTimeoutTimestamp != stateData.timeoutTimestamp
    val shouldWriteState = groupState.hasUpdated || groupState.hasRemoved || hasTimeoutChanged

    if (shouldWriteState) {
      val updatedStateObj = if (groupState.exists) groupState.get else null
      stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp)
      numUpdatedStateRows += 1
    }
  }
}

引发任意状态流聚合，flatMapGroupsWithState API

1 个答案: