由于与正在读取的主题不同的Kafka主题上的错误,Spark Streaming失败

时间:2019-07-14 20:29:36

标签: apache-spark apache-kafka spark-structured-streaming

对于以下编写主题/阅读主题air2008rand串联:

import org.apache.spark.sql.streaming.Trigger
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "air2008rand")
.load()
.groupBy('value.cast("string").as('key))
.agg(count("*").cast("string") as 'value)
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("includeTimestamp", true)
.option("topic","t1")
.trigger(Trigger.ProcessingTime("2 seconds"))
.outputMode("update")
.option("checkpointLocation","/tmp/cp")
.start)

由于主题<{> 1 不同而产生错误:

air2008m1-0

通过停止读/写代码(在spark-shell scala> 19/07/14 13:27:22 ERROR MicroBatchExecution: Query [id = 711d44b2-3224-4493-8677-e5c8cc4f3db4, runId = 68a3519a-e9cf-4a82-9d96-99be833227c0] terminated with error java.lang.IllegalStateException: Set(air2008m1-0) are gone. Some data may have been missed. Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you don't want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "false". at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.org$apache$spark$sql$kafka010$KafkaMicroBatchReader$$reportDataLoss(KafkaMicroBatchReader.scala:261) at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.planInputPartitions(KafkaMicroBatchReader.scala:124) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions$lzycompute(DataSourceV2ScanExec.scala:76) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions(DataSourceV2ScanExec.scala:75) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:65) 中)然后重新运行它,可以重复此行为。

为什么这里的不同kafka主题之间存在“串扰”?

1 个答案:

答案 0 :(得分:1)

问题是由于检查点目录包含来自较早火花流操作的数据。解决方法是更改​​检查点目录。

在此问题[IllegalStateException]: Spark Structured Streaming is termination Streaming Query with Error

中找到了该解决方案作为注释(来自@jaceklaskowski本人)。