在Spark Streaming中,可以(并且必须使用有状态操作)将StreamingContext
设置为执行可靠的数据存储(S3,HDFS,...) (AND):
DStream
血统如上所述[{3}},要设置输出数据存储,您需要调用yourSparkStreamingCtx.checkpoint(datastoreURL)
另一方面,可以通过调用DataStream
来为每个checkpoint(timeInterval)
设置谱系检查点间隔。事实上,建议将谱系检查点间隔设置为DataStream
滑动间隔的5到10倍:
dstream.checkpoint(checkpointInterval)。通常,检查站 DStream的间隔为5-10个滑动间隔是一个很好的设置 尝试。
我的问题是:
当流式上下文设置为执行检查点并且没有ds.checkpoint(interval)
被称为时,是否为所有数据流启用了沿袭检查点,默认checkpointInterval
等于{{ 1}?或者,相反,只有元数据检查点启用了什么?
答案 0 :(得分:11)
检查Spark代码(v1.5)我发现DStream
s'在两种情况下启用检查点:
通过明确调用他们的checkpoint
方法(而不是StreamContext
')
/**
* Enable periodic checkpointing of RDDs of this DStream
* @param interval Time interval after which generated RDD will be checkpointed
*/
def checkpoint(interval: Duration): DStream[T] = {
if (isInitialized) {
throw new UnsupportedOperationException(
"Cannot change checkpoint interval of an DStream after streaming context has started")
}
persist()
checkpointDuration = interval
this
}
只要具体的&#D; DStream'子类已覆盖DStream
属性(将其设置为mustCheckpoint
):
true
第一种情况很明显。对Spark Streaming代码执行简单的分析:
private[streaming] def initialize(time: Time) {
...
...
// Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger
if (mustCheckpoint && checkpointDuration == null) {
checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
logInfo("Checkpoint interval automatically set to " + checkpointDuration)
}
...
我可以发现,通常(忽略grep "val mustCheckpoint = true" $(find -type f -name "*.scala")
> ./org/apache/spark/streaming/api/python/PythonDStream.scala: override val mustCheckpoint = true
>./org/apache/spark/streaming/dstream/ReducedWindowedDStream.scala: override val mustCheckpoint = true
>./org/apache/spark/streaming/dstream/StateDStream.scala: override val mustCheckpoint = true
),PythonDStream
检查点仅启用StreamingContext
和StateDStream
个实例的沿袭检查点。这些实例是转换的结果(分别为AND):