卡夫卡卡在重新分配分区工具和进度

时间:2018-12-27 17:53:15

标签: apache-kafka

运行重新分配分区工具,以将分区扩​​展到5个代理而不是5个。 Docker上的Kafka 2.1。

到了其中一个节点行为异常的地步。 其他(健康)节点开始显示以下消息:

[2018-12-27 13:00:31,618] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=48303608, epoch=226826) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2018-12-27 13:00:31,620] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={impressions-35=(offset=3931626, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-26=(offset=4273048, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-86=(offset=3660830, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-93=(offset=2535787, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[26]), impressions-53=(offset=3683354, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-59=(offset=3696315, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-11=(offset=3928338, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-69=(offset=2510463, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-72=(offset=2481181, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-75=(offset=2462527, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-126=(offset=2510344, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-63=(offset=2515896, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=48303608, epoch=226826)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
    at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
    at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
    at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
    at scala.Option.foreach(Option.scala:257)
    at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

15分钟后,运行状况良好的服务器显示以下消息:

[2018-12-27 13:16:00,540] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition events-111 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)

稍后,我们会看到很多这样的消息:

[2018-12-27 17:20:21,132] WARN [ReplicaManager broker=1] While recording the replica LEO, the partition events-116 hasn't been created. (kafka.server.ReplicaManager)

在其中的其他集合中,更常见:

[2018-12-27 17:20:21,138] WARN [ReplicaManager broker=1] Leader 1 failed to record follower 3's position 2517140 since the replica is not recognized to be one of the ass

为分区事件53点燃了副本1,4,6。将为此分区返回空记录。 (kafka.server.ReplicaManager)

重新分配的主题在3台服务器中具有128个分区。总而言之,每个服务器大约有2000个分区。

现在重新分配卡住了6个小时,显示卡住的41%分区复制不足。 它具有复制3,尽管现在具有复制5。我想这是下面如何进行重新平衡的一部分,以便增加这些副本并在以后删除不需要的副本?

但是,节点3显示以下消息:

[2018-12-27 17:10:05,509] WARN [RequestSendThread controllerId=3] Controller 3 epoch 14 fails to send request (type=LeaderAndIsRequest, controllerId=3, controllerEpoch=14, partitionStates={events-125=PartitionState(controllerEpoch=14, leader=1, leaderEpoch=25, isr=3,1,2, zkVersion=57, replicas=1,6,2,3, isNew=false)}, liveLeaders=(172.31.10.35:9092 (id: 1 rack: eu-west-1c))) to broker 172.31.27.111:9092 (id: 3 rack: eu-west-1a). Reconnecting to broker. (kafka.controller.RequestSendThread)

因此,节点“ 3”出了点问题-我怎么知道它发生了什么?

这是我们两次尝试在相同分区大小的两个主题中重新分配分区。在前一种情况下,我们以相同的ID启用了另一台计算机作为新代理(重新启动容器无济于事),并且计算机恢复了。但是,如何避免这种情况发生?

根本原因是什么?

1 个答案:

答案 0 :(得分:0)

自从写这篇文章以来已经过去了一段时间。但是,如果对任何人有帮助,我认为对设置有所帮助的是:增加 zookeeper.session.timeout.mszookeeper.connection.timeout.msreplica.lag.time.max.ms在我们的情况下是60000

从那时起,它再没有发生过。背后的想法是,某个经纪人在某个时刻失去了ZK会话,并在认为该经纪人还活着的经纪人与认为不是经纪人的ZK之间造成了误连接。由于某种原因,它永远不会被清除。增加这些设置可以延长会话的粘性时间。提防,要使真正死亡的经纪人过期也将花费更长的时间。