Spark只消耗一个Kafka主题分区的记录

时间:2017-07-28 14:46:30

标签: scala apache-spark apache-kafka spark-streaming

我有一个消耗卡夫卡的火花流媒体应用程序:

KafkaUtils.createDirectStream[String, String](
  ssc,
  PreferConsistent,
  Subscribe[String, String](Set(kafkaTopic), kafkaParams)
)

Kafka参数默认为https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

只要kafka主题有一个分区,一切正常。然而,当它更多的时候(2),Spark似乎只能阅读其中一个。 这就是我在日志中看到的内容:

17/07/28 12:08:15 INFO kafka010.KafkaRDD: Computing topic processedJobs, partition 0 offsets 20 -> 29
17/07/28 12:08:15 INFO kafka010.KafkaRDD: Beginning offset 0 is the same as ending offset skipping processedJobs 1
17/07/28 12:08:20 INFO kafka010.KafkaRDD: Beginning offset 29 is the same as ending offset skipping processedJobs 0
17/07/28 12:08:20 INFO kafka010.KafkaRDD: Beginning offset 0 is the same as ending offset skipping processedJobs 1

kafka-consumer-offset-checker.sh --zookeeper $ZOOKEEPER --topic processedJobs --group counselor-01的输出结果为:

Group           Topic                          Pid Offset          logSize         Lag             Owner
counselor-01    processedJobs                  0   29              29              0               none
counselor-01    processedJobs                  1   0               28              28              none

我认为这与错误的偏移提交有关,使用这种模式:

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

然后我将enable.auto.commit更改为true并删除了提交,但问题仍然存在(其中一个分区的延迟会继续增长)。

我在docker环境中运行Spark 2.1.0和Kafka 0.10.1.0。关于这可能是什么问题的任何提示?

我已经看过这个主题:Spark Structured Stream get messages from only one partition of Kafka 但是对于卡夫卡0.10.0.1我得到了:

WARN clients.NetworkClient: Bootstrap broker kafka:9092 disconnected

错误与火花

0 个答案:

没有答案