Kafka分区在某些节点上不同步

时间:2018-06-27 12:34:16

标签: apache-kafka apache-zookeeper

我正在3个EC2实例上运行Kafka集群。每个实例都运行kafka(0.11.0.1)和zookeeper(3.4)。我的主题已配置为每个都有20个分区,而ReplicationFactor为3。

今天,我注意到某些分区拒绝同步到所有三个节点。这是一个示例:

bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" --describe --topic prod-decline
Topic:prod-decline    PartitionCount:20    ReplicationFactor:3    Configs:
    Topic: prod-decline    Partition: 0    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 1    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 2    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 3    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 4    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 5    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 6    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 7    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 8    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 9    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 10    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 11    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 12    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 13    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 14    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 15    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 16    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 17    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 18    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 19    Leader: 2    Replicas: 2,0,1    Isr: 2

只有节点2具有所有同步数据。我曾尝试重新启动经纪人0和1,但这并没有改善情况-甚至使情况更糟。我很想重新启动节点2,但是我假设它将导致停机或群集故障,因此我希望避免这种情况。

我没有在日志中看到任何明显的错误,因此我很难确定如何调试情况。任何提示将不胜感激。

谢谢!

编辑:一些其他信息...如果我检查节点2(具有完整数据的那个节点)上的指标,它的确会意识到某些分区没有正确复制。

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 930;

节点0和1没有。他们似乎认为一切都很好:

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 0;

这是预期的行为吗?

1 个答案:

答案 0 :(得分:1)

尝试增加replica.lag.time.max.ms

说明如下:

如果副本未能发送提取请求的时间超过replica.lag.time.max.ms,则该副本被视为已失效,并已从ISR中删除。

如果副本开始滞后于领导者的时间超过replica.lag.time.max.ms,则认为副本速度太慢,因此已从ISR中删除。因此,即使流量激增并且在头服务器上写入大量消息,除非副本始终保持副本服务器的副本后方的延迟.lag.time.max.ms,它也不会在ISR中混入和移出。