Question

我正在3个EC2实例上运行Kafka集群。每个实例都运行kafka（0.11.0.1）和zookeeper（3.4）。我的主题已配置为每个都有20个分区，而ReplicationFactor为3。

今天，我注意到某些分区拒绝同步到所有三个节点。这是一个示例：

bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" --describe --topic prod-decline
Topic:prod-decline    PartitionCount:20    ReplicationFactor:3    Configs:
    Topic: prod-decline    Partition: 0    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 1    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 2    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 3    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 4    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 5    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 6    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 7    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 8    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 9    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 10    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 11    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 12    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 13    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 14    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 15    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 16    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 17    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 18    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 19    Leader: 2    Replicas: 2,0,1    Isr: 2

只有节点2具有所有同步数据。我曾尝试重新启动经纪人0和1，但这并没有改善情况-甚至使情况更糟。我很想重新启动节点2，但是我假设它将导致停机或群集故障，因此我希望避免这种情况。

我没有在日志中看到任何明显的错误，因此我很难确定如何调试情况。任何提示将不胜感激。

谢谢！

编辑：一些其他信息...如果我检查节点2（具有完整数据的那个节点）上的指标，它的确会意识到某些分区没有正确复制。

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 930;

节点0和1没有。他们似乎认为一切都很好：

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 0;

这是预期的行为吗？

Answer 1

尝试增加replica.lag.time.max.ms。

说明如下：

如果副本未能发送提取请求的时间超过replica.lag.time.max.ms，则该副本被视为已失效，并已从ISR中删除。

如果副本开始滞后于领导者的时间超过replica.lag.time.max.ms，则认为副本速度太慢，因此已从ISR中删除。因此，即使流量激增并且在头服务器上写入大量消息，除非副本始终保持副本服务器的副本后方的延迟.lag.time.max.ms，它也不会在ISR中混入和移出。

Kafka分区在某些节点上不同步

1 个答案: