了解领导者变更后使用者的提交失败

时间:2018-09-12 17:25:50

标签: java apache-kafka

考虑以下真实的混淆日志:

 19:33:48,409 99733391 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: This is not the correct coordinator.
 19:33:48,410 99733392 (pool-6-thread-11) INFO  [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Group coordinator kafka1.maria4.internal:9092 (id: 2147483646 rack: null) is unavailable or invalid, will attempt rediscovery
 19:33:48,414 99733396 (kafka-producer-network-thread | producer-1) WARN  [org.apache.kafka.clients.producer.internals.Sender] [] [Producer clientId=producer-1] Got error produce response with correlation id 16386 on topic-partition service_megaman_mo-mcdonalnds_service_msg-1, retrying (99 attempts left). Error: NOT_LEADER_FOR_PARTITION
 19:33:48,510 99733492 (pool-6-thread-11) INFO  [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Discovered group coordinator kafka3.maria4.internal:9092 (id: 2147483644 rack: null)
 19:33:48,528 99733510 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: The coordinator is not aware of this member.
 19:33:48,528 99733510 (pool-6-thread-11) ERROR [com.bob.kafka.consumer.ListenableKafkaConsumer] [] Aborting consumer [mcdonalnds_service_msg] for topics [[service_megaman_mt-mcdonalnds_service_msg]] operation due to failure! Cause: 
 org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.

据我了解,关于poll()的异常消息并不是真正的原因。所以发生了什么事:  1.协调员不可用  2.消费者找到新的协调员  3.新的协调员无法识别偏移量,因此拒绝了提交

我要弄清楚的是从这种情况中恢复的选项。这不是断断续续的问题,而是每年发生一次,所以如果领导者去世,民意调查的设置将无济于事。

现在会发生什么::原始应用程序代码只是关闭使用者,这是错误的,导致警报,并且随着应用程序停止使用消息,几乎所有人都醒来了:-)

我想发生的事情: 消费者重新启动,如果失去与协调器的连接不会死

我不确定的地方:

  1. 协调员为什么不知道该成员

  2. 如果我正确理解该问题。 :-)

  3. 在Java Kafka lib的服务端,用于I类KafkaConsumer     应该致电关闭订阅退订,然后     订阅来完成我的消费者恢复方案。

  4. 处理后会发生什么 抵销     新的协调员拒绝了哪个?由于未提交偏移量,我认为使用者将重新读取相同的消息吗?

post之后使用Spring-kafka看起来非常相似,但是该服务未使用Spring,因此对我来说用途有限。

0 个答案:

没有答案