WAL在Spark结构化流中的位置

时间:2020-02-24 23:45:26

标签: apache-spark spark-streaming spark-structured-streaming

我已经为我的结构化流应用程序启用了WAL。在哪里可以找到WAL日志的位置? 我可以在前缀 receivedBlockMetadata 中看到我的Spark流式处理过程的WAL。但是,我看不到为结构化流创建的任何前缀

2 个答案:

答案 0 :(得分:1)

根据我的理解,WAL仅适用于Spark流,而不适用于结构化流。 结构化流基于诸如flink全局状态之类的检查点实现容错。检查点存储所有状态,包括kafka偏移量和其他状态。位置在您的代码中指定。

答案 1 :(得分:0)

在Spark Structure Streaming中,现在将WAL与接收方的所有消息一起使用。 每个批次只有两个带有元数据的日志:偏移日志和提交日志。 您可以在org.apache.spark.sql.execution.streaming.StreamExecution中找到实现的详细信息。 ->

/**
   * A write-ahead-log that records the offsets that are present in each batch. In order to ensure
   * that a given batch will always consist of the same data, we write to this log *before* any
   * processing is done.  Thus, the Nth record in this log indicated data that is currently being
   * processed and the N-1th entry indicates which offsets have been durably committed to the sink.
   */
  val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))

  /**
   * A log that records the batch ids that have completed. This is used to check if a batch was
   * fully processed, and its output was committed to the sink, hence no need to process it again.
   * This is used (for instance) during restart, to help identify which batch to run next.
   */
  val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))

这两个选项在checkpointLocation中的文件夹偏移量和提交中均可用。 在“结构流式传输”中,日志仅包含偏移量信息。