我有一个消耗卡夫卡的火花流媒体应用程序:
KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](Set(kafkaTopic), kafkaParams)
)
Kafka参数默认为https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
只要kafka主题有一个分区,一切正常。然而,当它更多的时候(2),Spark似乎只能阅读其中一个。 这就是我在日志中看到的内容:
17/07/28 12:08:15 INFO kafka010.KafkaRDD: Computing topic processedJobs, partition 0 offsets 20 -> 29
17/07/28 12:08:15 INFO kafka010.KafkaRDD: Beginning offset 0 is the same as ending offset skipping processedJobs 1
17/07/28 12:08:20 INFO kafka010.KafkaRDD: Beginning offset 29 is the same as ending offset skipping processedJobs 0
17/07/28 12:08:20 INFO kafka010.KafkaRDD: Beginning offset 0 is the same as ending offset skipping processedJobs 1
kafka-consumer-offset-checker.sh --zookeeper $ZOOKEEPER --topic processedJobs --group counselor-01
的输出结果为:
Group Topic Pid Offset logSize Lag Owner
counselor-01 processedJobs 0 29 29 0 none
counselor-01 processedJobs 1 0 28 28 none
我认为这与错误的偏移提交有关,使用这种模式:
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
然后我将enable.auto.commit
更改为true
并删除了提交,但问题仍然存在(其中一个分区的延迟会继续增长)。
我在docker环境中运行Spark 2.1.0和Kafka 0.10.1.0。关于这可能是什么问题的任何提示?
我已经看过这个主题:Spark Structured Stream get messages from only one partition of Kafka 但是对于卡夫卡0.10.0.1我得到了:
WARN clients.NetworkClient: Bootstrap broker kafka:9092 disconnected
错误与火花