我有一名卡夫卡消费者。它似乎工作了一段时间,然后死了。它反复这样做。我得到了这个例外但没有其他信息。
org.apache.kafka.common.errors.TimeoutException:
Failed to get offsets by times in 305000 ms
305000毫秒是5分钟。有什么可能导致这个问题的线索吗?或尝试找出的步骤?
如果相关:
我在不同的机器上有3个进程,使用最新的Java Kafka Client版本0.10.2.0。每台机器运行20个线程,每个线程都有一个单独的Consumer。按照设计,当一个线程死亡时,所有线程都被杀死,进程终止,并重新启动。这导致约20名消费者同时死亡和重新启动,这将导致重新平衡。因此,这可能会导致客户端之间的周期性干扰。但这并没有解释为什么我首先得到这个例外。
我有三台Kafka机器和三台Zookeeper机器。每个客户端都有bootstrap.servers
配置的所有3台Kafka计算机。该主题有200个分区,这意味着每个线程分配大约3个分区。该主题的复制因子为2.
Kafka或Zookeeper日志中没有错误。
设置了以下配置值,没有其他配置值。
答案 0 :(得分:1)
我今天遇到了这个。我看到了此错误消息的两个不同版本,具体取决于我使用的是Kafka 1.0客户端库还是Kafka 2.0客户端库。对于Kafka 1.0客户端,错误消息为"org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 305000 ms"
,对于2.0客户端库,错误消息为"org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 30003ms"
。
当我尝试使用kafka-console-consumer命令(例如kafka-consumer-groups --bootstrap-server {servers} --group {group} --describe
)监视偏移量/滞后时收到此消息。这些命令是kafka / confluent工具的一部分,但我想这可能会发生在其他客户端上。
问题似乎是我有一个主题的复制因子为1,该主题的分区没有分配的领导者。我发现此问题的唯一方法是将{kafka_client_dir}\libexec\config\tools-log4j.properties
文件更新为DEBUG级别的日志:log4j.rootLogger=DEBUG, stderr
请注意,这是kafka / confluent工具的log4j配置文件-其他客户端的YMMV 。我正在Mac上运行它们。
完成此操作后,我在输出中看到以下消息,这使我警觉到ISR / offlineReplicas问题:
[2019-01-28 11:41:54,290] DEBUG Updated cluster metadata version 2 to Cluster(id = 0B1zi_bbQVyrfKwqiDa2kw,
nodes = [
brokerServer3:9092 (id: 3 rack: null),
brokerServer6:9092 (id: 6 rack: null),
brokerServer1:9092 (id: 1 rack: null),
brokerServer5:9092 (id: 5 rack: null),
brokerServer4:9092 (id: 4 rack: null)], partitions = [
Partition(topic = myTopicWithReplicatinFactorOne, partition = 10, leader = 6, replicas = [6], isr = [6], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 11, leader = 1, replicas = [1], isr = [1], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 12, leader = none, replicas = [2], isr = [], offlineReplicas = [2]),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 13, leader = 3, replicas = [3], isr = [3], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 14, leader = 4, replicas = [4], isr = [4], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 2, leader = 4, replicas = [4], isr = [4], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 3, leader = 5, replicas = [5], isr = [5], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 4, leader = 6, replicas = [6], isr = [6], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 5, leader = 1, replicas = [1], isr = [1], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 6, leader = none, replicas = [2], isr = [], offlineReplicas = [2]),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 7, leader = 3, replicas = [3], isr = [3], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 8, leader = 4, replicas = [4], isr = [4], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 9, leader = 5, replicas = [5], isr = [5], offlineReplicas = []),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 0, leader = none, replicas = [2], isr = [], offlineReplicas = [2]),
Partition(topic = myTopicWithReplicatinFactorOne, partition = 1, leader = 3, replicas = [3], isr = [3], offlineReplicas = [])
], controller = brokerServer4:9092 (id: 4 rack: null)) (org.apache.kafka.clients.Metadata)
您可以在上方看到offlineReplicas = [2]
的位置-提示问题。另外,brokerServer2
不在经纪人列表中。
最终,我重新启动了受影响的代理(brokerServer2
)以使其恢复同步,完成此操作后,再次使用命令行工具就没有问题。解决此问题的方法可能比重新启动代理更好的方法,但这最终解决了该问题