当我反弹经纪人时,Apache Kafka会失去一些消费者补偿

时间:2018-09-17 12:32:00

标签: apache-kafka

Kafka 1.1.1-cp1。 (编辑4:我最终向Kafka提交了一个有关此问题的错误-https://issues.apache.org/jira/browse/KAFKA-7447

我有3个经纪人,其中所有主题的min.insync.replicas = 2offsets.commit.required.acks = -1

如您所料,当我停止其中一位经纪人时,它会移交其负责人的分区,并且一切正常进行(消费,生产者生产)。

当我带回经纪人时,问题就开始了。似乎发生的是,这会导致集群混乱,并且某些__consumer_offset主题会立即被截断为0。

这里是按时间顺序从受影响的__consumer_offset分区(最初由崩溃的中间商引导的分区)中选择的日志。故事在所有三个经纪人的日志中播出。

本质上,我跳出的经纪人回来了,似乎不明白新领导者的意思,将其截断为0,然后说服其他副本也截断为0。

prod-kafka-2 :(刚刚启动)

[2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Based on follower's leader epoch, leader replied with an unknown offset in __consumer_offsets-29. The initial fetch offset 0 will be used for truncation. (kafka.server.ReplicaFetcherThread)

prod-kafka-3 :(请参见副本1回来)

[2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition)

prod-kafka-2:

[2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling unloading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions __consumer_offsets-29 (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous Leader Epoch was: 77 (kafka.cluster.Partition)
[2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling loading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:24:03,288] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-29 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)

prod-kafka-3:努力同意prod-kafka-2。将其踢出ISR,然后与ZooKeeper战斗。也许2和3都认为自己是领导者?

[2018-09-17 09:24:15,372] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:15,377] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

prod-kafka-2:将另外两个副本同时从ISR列表中剔除掉,​​尽管2个是我们刚刚重启的副本,因此很可能落后于此。 (请记住,它已经决定将主题截断为0!)

[2018-09-17 09:24:16,481] INFO [Partition __consumer_offsets-29 broker=1] Shrinking ISR from 0,2,1 to 1 (kafka.cluster.Partition)
prod-kafka-3:仍在与Zookeeper战斗。最终输了。

[2018-09-17 09:24:20,374] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:20,378] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:25,347] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:25,350] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:30,359] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:30,362] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:35,365] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:35,368] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:40,352] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:40,354] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:45,422] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:45,425] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:50,345] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:50,348] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:55,444] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:55,449] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:00,340] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:00,343] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:05,374] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:05,377] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:10,342] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:10,344] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:15,348] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:15,351] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:20,338] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:20,340] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:25,338] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:25,340] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:30,382] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:30,387] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:35,341] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:35,344] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:40,460] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:40,465] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:45,335] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:45,338] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

prod-kafka-1:突然变得困惑,并重新初始化为0,因为prod-kafka-2显然成为了领导者。

[2018-09-17 09:25:48,807] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)

prod-kafka-3:最终确定prod-kafka-2负责,并相应地截断

[2018-09-17 09:25:48,806] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions __consumer_offsets-29 (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:48,807] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions List([__consumer_offsets-29, initOffset 0 to broker BrokerEndPoint(1,prod-kafka-2.c.i-lastfm-prod.internal,9092)] ) (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:48,809] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:48,810] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:48,950] WARN [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Based on follower's leader epoch, leader replied with an unknown offset in __consumer_offsets-29. The initial fetch offset 0 will be used for truncation. (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:48,951] INFO [Log partition=__consumer_offsets-29, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)

prod-kafka-1:领导就职确认。

[2018-09-17 09:25:50,207] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)

prod-kafka-2:现在它已经通过动物园管理员维护了统治地位,prod-kafka-3已添加到ISR列表中

[2018-09-17 09:25:50,210] INFO [Partition __consumer_offsets-29 broker=1] Expanding ISR from 1 to 1,2 (kafka.cluster.Partition)

prod-kafka-1:仍在努力接受现实,但最终也被截断为0。

[2018-09-17 09:25:51,430] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:52,615] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:53,637] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:54,150] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions __consumer_offsets-29 (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:54,151] INFO [ReplicaFetcherManager on broker 0] Added fetcher for partitions List([__consumer_offsets-29, initOffset 0 to broker BrokerEndPoint(1,prod-kafka-2.c.i-lastfm-prod.internal,9092)] ) (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:54,151] INFO [GroupMetadataManager brokerId=0] Scheduling unloading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:54,153] INFO [GroupMetadataManager brokerId=0] Finished unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:54,261] WARN [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Based on follower's leader epoch, leader replied with an unknown offset in __consumer_offsets-29. The initial fetch offset 0 will be used for truncation. (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:54,261] INFO [Log partition=__consumer_offsets-29, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)

prod-kafka-2:完成其消费者补偿计划,现在全部为0。

[2018-09-17 09:25:56,244] INFO [Partition __consumer_offsets-29 broker=1] Expanding ISR from 1,2 to 1,2,0 (kafka.cluster.Partition)

编辑:

根据要求,这是kafka server.properties文件:

broker.id=1
default.replication.factor=3
auto.create.topics.enable=false
min.insync.replicas=2
num.network.threads=12
num.io.threads=16
num.replica.fetchers=6
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/var/lib/kafka/data
num.partitions=1
num.recovery.threads.per.data.dir=4
offsets.retention.minutes=10080
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=60000
zookeeper.connect=prod-kafka-1:2181,prod-kafka-2:2181,prod-kafka-3:2181
zookeeper.connection.timeout.ms=6000
confluent.support.metrics.enable=false
confluent.support.customer.id=anonymous
group.initial.rebalance.delay.ms=3000

这是zookeeper.properties文件:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
server.1=prod-kafka-1:2888:3888
server.2=prod-kafka-2:2888:3888
server.3=prod-kafka-3:2888:3888
autopurge.purgeInterval=12
autopurge.snapRetainCount=6

编辑2 升级到Kafka-2.0.0似乎无法解决问题。

可能是我的传入速率太高,并且当我知道崩溃的服务器即将恢复时,需要限制生产者吗?听起来对吗...?

编辑3 设置auto.leader.rebalance.enable=false解决了该问题,但是现在我必须手动重新平衡。但是,当所有分区都陷入困境时,手动重新平衡似乎没有任何问题。

1 个答案:

答案 0 :(得分:1)

自提出问题以来取得了一些进展。如评论中所述,创建了Kafka-7447来跟踪此问题。尽管该问题具体尚未解决,但一些人提到该问题与Kafka-8896有关,并且实际上已经解决了。 (而且那些遇到原始问题的人在使用包含此改进的版本后不再有问题。)

这样,该问题已得到解决,使用Kafka 2.2.2及更高版本(或带有该补丁的任何版本)应确保您不会遇到此问题。