我正在3个EC2实例上运行Kafka集群。每个实例都运行kafka(0.11.0.1)和zookeeper(3.4)。我的主题已配置为每个都有20个分区,而ReplicationFactor为3。
今天,我注意到某些分区拒绝同步到所有三个节点。这是一个示例:
bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" --describe --topic prod-decline
Topic:prod-decline PartitionCount:20 ReplicationFactor:3 Configs:
Topic: prod-decline Partition: 0 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 1 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 2 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 3 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 4 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 5 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 6 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 7 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 8 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 9 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 10 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 11 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 12 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 13 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 14 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 15 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 16 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 17 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 18 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 19 Leader: 2 Replicas: 2,0,1 Isr: 2
只有节点2具有所有同步数据。我曾尝试重新启动经纪人0和1,但这并没有改善情况-甚至使情况更糟。我很想重新启动节点2,但是我假设它将导致停机或群集故障,因此我希望避免这种情况。
我没有在日志中看到任何明显的错误,因此我很难确定如何调试情况。任何提示将不胜感激。
谢谢!
编辑:一些其他信息...如果我检查节点2(具有完整数据的那个节点)上的指标,它的确会意识到某些分区没有正确复制。
$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 930;
节点0和1没有。他们似乎认为一切都很好:
$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 0;
这是预期的行为吗?
答案 0 :(得分:1)
尝试增加replica.lag.time.max.ms
。
说明如下:
如果副本未能发送提取请求的时间超过replica.lag.time.max.ms
,则该副本被视为已失效,并已从ISR中删除。
如果副本开始滞后于领导者的时间超过replica.lag.time.max.ms
,则认为副本速度太慢,因此已从ISR中删除。因此,即使流量激增并且在头服务器上写入大量消息,除非副本始终保持副本服务器的副本后方的延迟.lag.time.max.ms,它也不会在ISR中混入和移出。