KafkaStreams EXACTLY_ONCE保证 - 跳过kafka偏移

时间:2018-01-19 15:08:10

标签: apache-kafka spark-streaming offset apache-kafka-streams

我正在使用Spark 2.2.0和kafka 0.10 spark-streaming库来读取充满Kafka-Streams scala应用程序的主题。 Kafka Broker版本为0.11,Kafka-streams版本为0.11.0.2。

当我在Kafka-Stream应用程序中设置EXACTLY_ONCE保证时:

 p.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE)

我在Spark中遇到此错误:

java.lang.AssertionError: assertion failed: Got wrong record for spark-executor-<group.id> <topic> 0 even after seeking to offset 24
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:85)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.to(KafkaRDD.scala:189)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.toBuffer(KafkaRDD.scala:189)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.toArray(KafkaRDD.scala:189)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

如果未设置EXACTLY_ONCE属性,则可以正常工作。

编辑1: 填充kafka-streams应用程序的主题(一旦启用)具有错误的结束偏移。当我运行 kafka.tools.GetOffsetShell 时,它给出了结束偏移18,但在主题中只有12条消息(保留被禁用)。当禁用一次保证时,这些偏移是匹配的。我试图根据this重置kafka-streams,但问题仍然存在。

编辑2: 当我使用 - print-offsets 选项运行 SimpleConsumerShell 时,输出如下:

next offset = 1
{"timestamp": 149583551238149, "data": {...}}
next offset = 2
{"timestamp": 149583551238149, "data": {...}}
next offset = 4
{"timestamp": 149583551238149, "data": {...}}
next offset = 5
{"timestamp": 149583551238149, "data": {...}}
next offset = 7
{"timestamp": 149583551238149, "data": {...}}
next offset = 8
{"timestamp": 149583551238149, "data": {...}}
...

当启用完全一次的转发保证时,显然会跳过某些偏移。

有什么想法?是什么导致这个?谢谢!

1 个答案:

答案 0 :(得分:2)

我发现偏移差距是Kafka(版本&gt; = 0.11)中的预期行为,这些是由提交/中止事务标记引起的。

有关kafka交易和控制消息的更多信息here

  

这些事务标记不会暴露给应用程序,但是   消费者在read_committed模式下用来过滤掉来自的消息   中止事务并且不返回属于open的消息   交易(即日志中但没有交易的交易)   与他们相关的交易标记)。

here

Kafka交易是在Kafka 0.11中引入的,所以我假设spark-streaming-kafka库0.10与此消息格式不兼容,而较新版本的spark-streaming-kafka尚未实现。