Question

我已经为我的结构化流应用程序启用了WAL。在哪里可以找到WAL日志的位置？我可以在前缀 receivedBlockMetadata 中看到我的Spark流式处理过程的WAL。但是，我看不到为结构化流创建的任何前缀

Answer 1

根据我的理解，WAL仅适用于Spark流，而不适用于结构化流。结构化流基于诸如flink全局状态之类的检查点实现容错。检查点存储所有状态，包括kafka偏移量和其他状态。位置在您的代码中指定。

Answer 2

在Spark Structure Streaming中，现在将WAL与接收方的所有消息一起使用。每个批次只有两个带有元数据的日志：偏移日志和提交日志。您可以在org.apache.spark.sql.execution.streaming.StreamExecution中找到实现的详细信息。 ->

/**
   * A write-ahead-log that records the offsets that are present in each batch. In order to ensure
   * that a given batch will always consist of the same data, we write to this log *before* any
   * processing is done.  Thus, the Nth record in this log indicated data that is currently being
   * processed and the N-1th entry indicates which offsets have been durably committed to the sink.
   */
  val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))

  /**
   * A log that records the batch ids that have completed. This is used to check if a batch was
   * fully processed, and its output was committed to the sink, hence no need to process it again.
   * This is used (for instance) during restart, to help identify which batch to run next.
   */
  val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))

这两个选项在checkpointLocation中的文件夹偏移量和提交中均可用。在“结构流式传输”中，日志仅包含偏移量信息。

WAL在Spark结构化流中的位置

2 个答案: