Spark在水槽中构建流式一致性

时间:2017-11-07 13:52:59

标签: apache-spark spark-streaming

我希望在以下情况下更好地理解Spark 2.2结构化流的一致性模型:

  • 一个来源(Kinesis)
  • 从此源向2个不同的接收器发出2个查询:一个文件接收器用于存档目的(S3),另一个接收器用于处理数据(DB或文件,尚未确定)

我想了解是否在整个接收器之间存在任何一致性保证,至少在某些情况下:

  • 其中一个水槽可以领先于另一个水槽吗?或者他们在源上以相同的速度消耗数据(因为它是相同的源)?它们可以同步吗?
  • 如果我(优雅地)停止流应用程序,2下沉的数据是否会一致?

原因是我想构建一个类似Kappa的处理应用程序,能够在我想重新处理某些历史记录时暂停/关闭流媒体部分,当我恢复流式传输时,避免重新处理已经处理过的东西(如在历史记录中),或遗漏了一些(例如,一些尚未提交到存档的数据,然后在流式恢复时已经处理的数据被跳过)

2 个答案:

答案 0 :(得分:7)

要记住的一件重要事情是2个不同的查询将使用2个不同的查询,每个查询独立于源。因此,检查点是按查询完成的。

每当您致start DataStreamWriter导致查询时,如果设置checkpointLocation,每个查询都会有自己的检查点来跟踪偏移量从水槽。

val input = spark.readStream....

val query1 = input.select('colA, 'colB)
  .writeStream
  .format("parquet")
  .option("checkpointLocation", "path/to/checkpoint/dir1")
  .start("/path1")

val query2 = input.select('colA, 'colB)
  .writeStream
  .format("csv")
  .option("checkpointLocation", "path/to/checkpoint/dir2")
  .start("/path2")

因此,每个查询都是从源读取并独立跟踪偏移量。这也意味着,每个查询可以处于输入流的不同偏移量,您可以重新启动其中一个或两个,而不会影响另一个。

答案 1 :(得分:3)

西尔维奥写的东西是绝对正确的。 写入2个接收器将启动两个彼此独立运行的流查询(实际上,两个流应用程序读取相同的数据2次,处理2次并自行检查点)。

我想补充一点,如果您希望两个查询在任何一个查询重新启动或失败的情况下同时停止/暂停,则可以使用api选项: awaitAnyTermination ()

而不是使用:

query.start().awaitTermination()

使用:

sparkSession.streams.awaitAnyTermination()

从api文档中添加摘录:

/**
   * Wait until any of the queries on the associated SQLContext has terminated since the
   * creation of the context, or since `resetTerminated()` was called. If any query was terminated
   * with an exception, then the exception will be thrown.
   *
   * If a query has terminated, then subsequent calls to `awaitAnyTermination()` will either
   * return immediately (if the query was terminated by `query.stop()`),
   * or throw the exception immediately (if the query was terminated with exception). Use
   * `resetTerminated()` to clear past terminations and wait for new terminations.
   *
   * In the case where multiple queries have terminated since `resetTermination()` was called,
   * if any query has terminated with exception, then `awaitAnyTermination()` will
   * throw any of the exception. For correctly documenting exceptions across multiple queries,
   * users need to stop all of them after any of them terminates with exception, and then check the
   * `query.exception()` for each query.
   *
   * @throws StreamingQueryException if any query has terminated with an exception
   *
   * @since 2.0.0
   */
  @throws[StreamingQueryException]
  def awaitAnyTermination(): Unit = {
    awaitTerminationLock.synchronized {
      while (lastTerminatedQuery == null) {
        awaitTerminationLock.wait(10)
      }
      if (lastTerminatedQuery != null && lastTerminatedQuery.exception.nonEmpty) {
        throw lastTerminatedQuery.exception.get
      }
    }
  }