针对DStream的Spark流检查点

时间:2015-12-31 18:33:19

标签: apache-spark spark-streaming checkpointing

在Spark Streaming中,可以(并且必须使用有状态操作)将StreamingContext设置为执行可靠的数据存储(S3,HDFS,...) (AND):

  • 元数据
  • DStream血统

如上所述[{3}},要设置输出数据存储,您需要调用yourSparkStreamingCtx.checkpoint(datastoreURL)

另一方面,可以通过调用DataStream来为每个checkpoint(timeInterval)设置谱系检查点间隔。事实上,建议将谱系检查点间隔设置为DataStream滑动间隔的5到10倍:

  

dstream.checkpoint(checkpointInterval)。通常,检查站   DStream的间隔为5-10个滑动间隔是一个很好的设置   尝试。

我的问题是:

当流式上下文设置为执行检查点并且没有ds.checkpoint(interval)被称为时,是否为所有数据流启用了沿袭检查点,默认checkpointInterval等于{{ 1}?或者,相反,只有元数据检查点启用了什么?

1 个答案:

答案 0 :(得分:11)

检查Spark代码(v1.5)我发现DStream s'在两种情况下启用检查点:

通过明确调用他们的checkpoint方法(而不是StreamContext')

/**
* Enable periodic checkpointing of RDDs of this DStream
* @param interval Time interval after which generated RDD will be checkpointed
*/
def checkpoint(interval: Duration): DStream[T] = {
    if (isInitialized) {
        throw new UnsupportedOperationException(
            "Cannot change checkpoint interval of an DStream after streaming context has started")
    }
    persist()
    checkpointDuration = interval
    this
}

只要具体的&#D; DStream'子类已覆盖DStream属性(将其设置为mustCheckpoint):

true

第一种情况很明显。对Spark Streaming代码执行简单的分析:

 private[streaming] def initialize(time: Time) {
  ...
  ...   
   // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger
   if (mustCheckpoint && checkpointDuration == null) {
     checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
     logInfo("Checkpoint interval automatically set to " + checkpointDuration)
   }
  ...

我可以发现,通常(忽略grep "val mustCheckpoint = true" $(find -type f -name "*.scala") > ./org/apache/spark/streaming/api/python/PythonDStream.scala: override val mustCheckpoint = true >./org/apache/spark/streaming/dstream/ReducedWindowedDStream.scala: override val mustCheckpoint = true >./org/apache/spark/streaming/dstream/StateDStream.scala: override val mustCheckpoint = true ),PythonDStream检查点仅启用StreamingContextStateDStream个实例的沿袭检查点。这些实例是转换的结果(分别为AND):

  • updateStateByKey :即通过多个窗口提供状态的流。
  • reduceByKeyAndWindow