运行重新分配分区工具,以将分区扩展到5个代理而不是5个。 Docker上的Kafka 2.1。
到了其中一个节点行为异常的地步。 其他(健康)节点开始显示以下消息:
[2018-12-27 13:00:31,618] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=48303608, epoch=226826) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2018-12-27 13:00:31,620] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={impressions-35=(offset=3931626, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-26=(offset=4273048, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-86=(offset=3660830, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-93=(offset=2535787, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[26]), impressions-53=(offset=3683354, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-59=(offset=3696315, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-11=(offset=3928338, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-69=(offset=2510463, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-72=(offset=2481181, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-75=(offset=2462527, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-126=(offset=2510344, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-63=(offset=2515896, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=48303608, epoch=226826)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
at scala.Option.foreach(Option.scala:257)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
15分钟后,运行状况良好的服务器显示以下消息:
[2018-12-27 13:16:00,540] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition events-111 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)
稍后,我们会看到很多这样的消息:
[2018-12-27 17:20:21,132] WARN [ReplicaManager broker=1] While recording the replica LEO, the partition events-116 hasn't been created. (kafka.server.ReplicaManager)
在其中的其他集合中,更常见:
[2018-12-27 17:20:21,138] WARN [ReplicaManager broker=1] Leader 1 failed to record follower 3's position 2517140 since the replica is not recognized to be one of the ass
为分区事件53点燃了副本1,4,6。将为此分区返回空记录。 (kafka.server.ReplicaManager)
重新分配的主题在3台服务器中具有128个分区。总而言之,每个服务器大约有2000个分区。
现在重新分配卡住了6个小时,显示卡住的41%分区复制不足。 它具有复制3,尽管现在具有复制5。我想这是下面如何进行重新平衡的一部分,以便增加这些副本并在以后删除不需要的副本?
但是,节点3显示以下消息:
[2018-12-27 17:10:05,509] WARN [RequestSendThread controllerId=3] Controller 3 epoch 14 fails to send request (type=LeaderAndIsRequest, controllerId=3, controllerEpoch=14, partitionStates={events-125=PartitionState(controllerEpoch=14, leader=1, leaderEpoch=25, isr=3,1,2, zkVersion=57, replicas=1,6,2,3, isNew=false)}, liveLeaders=(172.31.10.35:9092 (id: 1 rack: eu-west-1c))) to broker 172.31.27.111:9092 (id: 3 rack: eu-west-1a). Reconnecting to broker. (kafka.controller.RequestSendThread)
因此,节点“ 3”出了点问题-我怎么知道它发生了什么?
这是我们两次尝试在相同分区大小的两个主题中重新分配分区。在前一种情况下,我们以相同的ID启用了另一台计算机作为新代理(重新启动容器无济于事),并且计算机恢复了。但是,如何避免这种情况发生?
根本原因是什么?
答案 0 :(得分:0)
自从写这篇文章以来已经过去了一段时间。但是,如果对任何人有帮助,我认为对设置有所帮助的是:增加
zookeeper.session.timeout.ms
,zookeeper.connection.timeout.ms
和replica.lag.time.max.ms
在我们的情况下是60000
。
从那时起,它再没有发生过。背后的想法是,某个经纪人在某个时刻失去了ZK会话,并在认为该经纪人还活着的经纪人与认为不是经纪人的ZK之间造成了误连接。由于某种原因,它永远不会被清除。增加这些设置可以延长会话的粘性时间。提防,要使真正死亡的经纪人过期也将花费更长的时间。