结构化流从kafka获取错误的当前偏移

时间:2019-01-24 19:20:49

标签: apache-spark spark-structured-streaming

使用lib "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.0"运行Spark结构化流时,我们会不断收到有关当前偏移量获取的错误:

  

由以下原因引起:org.apache.spark.SparkException:作业因阶段中止   失败:阶段0中的任务0失败4次,最近一次失败:丢失   第0.0阶段的任务0.3(TID 3,qa2-hdp-4.acuityads.org,执行者2):   java.lang.AssertionError:断言失败:最新情况   -9223372036854775808在scala处不等于-1.Predef $ .assert(Predef.scala:170)在   org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)   在   org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader。(KafkaMicroBatchReader.scala:329)   在   org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)   在   org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)处   org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)处   org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)处   org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)处   org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)   在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)   在org.apache.spark.scheduler.Task.run(Task.scala:121)在   org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:402)   在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)   在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:408)   在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)   在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)   在java.lang.Thread.run(Thread.java:745)

由于某种原因,看起来fetchLatestOffset返回了其中一个分区的Long.MIN_VALUE。我检查了结构化的流检查点,这是正确的,因为currentAvailableOffset设置为Long.MIN_VALUE。

kafka代理版本:1.1.0。 我们使用的lib:

{{libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.0" }}

如何复制: 基本上,我们启动了结构化的流媒体,并订阅了4个分区的主题。然后将一些消息发送到主题中,作业崩溃并记录了如上所述的stacktrace。

正如我们在日志中看到的那样,提交的偏移量似乎也不错:

=== Streaming Query ===
Identifier: [id = c46c67ee-3514-4788-8370-a696837b21b1, runId = 31878627-d473-4ee8-955d-d4d3f3f45eb9]
Current Committed Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: {"REVENUEEVENT":{"0":1}}}
Current Available Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: {"REVENUEEVENT":{"0":-9223372036854775808}}}

因此Spark Streaming记录了分区的正确值:0,但是从kafka返回的当前可用偏移量显示Long.MIN_VALUE。

1 个答案:

答案 0 :(得分:0)

发现了问题,这是由于spark结构化的流库内部整数溢出。详细信息在此处发布:https://issues.apache.org/jira/browse/SPARK-26718