在不受控制的经纪人关闭后,Kafka分区领导人选举失败

时间:2017-10-11 10:52:42

标签: apache-spark apache-kafka apache-zookeeper cloudera-cdh flume-ng

我们有3个kafka经纪人和40个分区的主题,复制因子设置为1.在一些分区不受控制的kafka经纪人关闭后,我们看到无法选出新的领导者(参见下面的日志)。最终我们无法从主题中读到。 请告知,如果可以在不将复制因子更改为大于1的情况下幸免于此类崩溃。

我们希望拥有目标数据库的一致状态(基于kafka主题的事件创建),因此我们还将参数unclean.leader.election.enable设置为false。

崩溃后的分区信息:

extenr-topic:1:882091242
extenr-topic:19:882091615
extenr-topic:28:882092273
Error: partition 18 does not have a leader. Skip getting offsets
Error: partition 27 does not have a leader. Skip getting offsets
Error: partition 36 does not have a leader. Skip getting offsets

kafka经纪人的例外情况:

2017-10-09 05:56:50,302 ERROR state.change.logger: Controller 236 epoch 267 initiated state change for partition [extenr-topic,15] from OfflinePartition to OnlinePartition failed
kafka.common.NoReplicaOnlineException: No broker in ISR for partition [extenr-topic,15] is alive. Live brokers are: [Set(236, 237)], ISR brokers are: [235]
at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:66)
at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:342)
at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:203)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:118)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:115)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)

日志中也存在以下错误

2017-10-09 04:11:25,509 ERROR state.change.logger: Broker 235 received LeaderAndIsrRequest with correlation id 1 from controller 236 epoch 267 for partition [extenr-topic,36] but cannot become follower since the new leader -1 is unavailable.

1 个答案:

答案 0 :(得分:1)

如果没有其他可用的副本可供接管,则当其领导者崩溃/关闭时,其中1为replication.factor的分区将变为脱机状态。

如果可用性对您很重要,我建议增加复制因子。推荐的高可用性配置[1]是replication.factor设置为3,min.insync.replicas设置为2.

1:http://kafka.apache.org/documentation/#brokerconfigs