我已经为我的结构化流应用程序启用了WAL。在哪里可以找到WAL日志的位置? 我可以在前缀 receivedBlockMetadata 中看到我的Spark流式处理过程的WAL。但是,我看不到为结构化流创建的任何前缀
答案 0 :(得分:1)
根据我的理解,WAL仅适用于Spark流,而不适用于结构化流。 结构化流基于诸如flink全局状态之类的检查点实现容错。检查点存储所有状态,包括kafka偏移量和其他状态。位置在您的代码中指定。
答案 1 :(得分:0)
在Spark Structure Streaming中,现在将WAL与接收方的所有消息一起使用。
每个批次只有两个带有元数据的日志:偏移日志和提交日志。
您可以在org.apache.spark.sql.execution.streaming.StreamExecution
中找到实现的详细信息。 ->
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
这两个选项在checkpointLocation
中的文件夹偏移量和提交中均可用。
在“结构流式传输”中,日志仅包含偏移量信息。