Question

我有像this这样的Spark流媒体/ DStream应用：

// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
  val ssc = new StreamingContext(...)   // new context
  val lines = ssc.socketTextStream(...) // create DStreams
  ...
  ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
  ssc
}

// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...

// Start the context
context.start()
context.awaitTermination()

我的上下文使用配置文件，我可以使用appConf.getString等方法提取项目。所以我实际上使用了：

val context = StreamingContext.getOrCreate(
    appConf.getString("spark.checkpointDirectory"), 
    () => createStreamContext(sparkConf, appConf))

其中val sparkConf = new SparkConf()...。

如果我停止我的应用并更改应用文件中的配置，除非我删除检查点目录内容，否则不会获取这些更改。例如，我想动态更改spark.streaming.kafka.maxRatePerPartition或spark.windowDurationSecs。（编辑：我终止应用程序，更改配置文件，然后重新启动应用程序。）如何动态更改这些设置或强制执行（EDITED WORD）配置更改而不会破坏我的检查点目录（即将包括州信息的检查点）？

Answer 1

您是否按照文档建议的方式创建了流式上下文，方法是使用StreamingContext.getOrCreate，以前checkpointDirectory作为参数？

// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
    val ssc = new StreamingContext(...)   // new context
    val lines = ssc.socketTextStream(...) // create DStreams
    ...
    ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
    ssc
}

// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...

// Start the context
context.start()
context.awaitTermination()

http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

Answer 2

如何在不丢弃检查点目录的情况下动态更改这些设置或强制执行配置更改？

如果深入了解StreamingContext.getOrCreate的代码：

def getOrCreate(
    checkpointPath: String,
    creatingFunc: () => StreamingContext,
    hadoopConf: Configuration = SparkHadoopUtil.get.conf,
    createOnError: Boolean = false
  ): StreamingContext = {
    val checkpointOption = CheckpointReader.read(
      checkpointPath, new SparkConf(), hadoopConf, createOnError)
    checkpointOption.map(new StreamingContext(null, _, null)).getOrElse(creatingFunc())
}

您可以看到，如果CheckpointReader在类路径中包含检查点数据，则会使用new SparkConf()作为参数，因为重载不允许传递自定义创建的SparkConf 。默认情况下，SparkConf将加载声明为环境变量或传递给类路径的任何设置：

class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {

  import SparkConf._

  /** Create a SparkConf that loads defaults from system properties and the classpath */
  def this() = this(true)

因此，实现目标的一种方法是，不是在代码中创建SparkConf对象，而是可以通过spark.driver.extraClassPath和spark.executor.extraClassPath将参数传递给spark-submit。

Answer 3

从checkpoint目录还原时无法添加/更新spark配置。您可以在文档中找到spark点检查行为：

当程序第一次启动时，它将创建一个新的StreamingContext，设置所有流然后调用start（）。当程序在失败后重新启动时，它将从检查点目录中的检查点数据重新创建StreamingContext

因此，如果您使用checkpoint目录，那么在重新启动作业时，它将从检查点数据重新创建StreamingContext，该数据将具有旧的sparkConf。

采用Check-Pointed Spark Stream的中流更改配置

3 个答案: