我们有3个Kafka(1.0.0)节点,一个主题有4个分区和3个副本。该主题通常如下所示:
Topic:MissionControlTopic PartitionCount:4 ReplicationFactor:3 Configs:
Topic: MissionControlTopic Partition: 0 Leader: 0 Replicas: 0,1,2 Isr: 2,1,0
Topic: MissionControlTopic Partition: 1 Leader: 1 Replicas: 1,2,0 Isr: 2,1,0
Topic: MissionControlTopic Partition: 2 Leader: 2 Replicas: 2,0,1 Isr: 2,1,0
Topic: MissionControlTopic Partition: 3 Leader: 0 Replicas: 0,2,1 Isr: 2,1,0
每隔一段时间,节点0停止响应(这是 问题,但不是 问题)。当发生这种情况时,其他两个节点正确地接管其分区,主题如下所示:
Topic:MissionControlTopic PartitionCount:4 ReplicationFactor:3 Configs:
Topic: MissionControlTopic Partition: 0 Leader: 1 Replicas: 0,1,2 Isr: 2,1
Topic: MissionControlTopic Partition: 1 Leader: 1 Replicas: 1,2,0 Isr: 2,1
Topic: MissionControlTopic Partition: 2 Leader: 2 Replicas: 2,0,1 Isr: 2,1
Topic: MissionControlTopic Partition: 3 Leader: 2 Replicas: 0,2,1 Isr: 2,1
此时,大多数(但不是全部)生产者和消费者无法写入/读取Kafka并继续记录LEADER_NOT_AVAILABLE
例外(第一期)。一旦节点0恢复并且领导者已经重新平衡,应用程序仍然会记录异常(第二期)。只有在应用程序重新启动后,它们才会重新连接并开始正常工作。正如您可能想象的那样,每当Kafka节点出现问题时重启所有应用程序都是不切实际的。
我不确定这里有什么信息可用于尝试解决此问题。我们已经搜索了互联网以获取信息,但我们没有发现任何迹象表明我们的配置有任何明显的错误。我甚至在本地重现了这个问题,并且一旦节点恢复,应用程序就会重新正确连接。
这是写给Kafka的代码:
Properties properties = new Properties();
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaUrl);
properties.put(ProducerConfig.ACKS_CONFIG, "all");
properties.put(ProducerConfig.RETRIES_CONFIG, 0);
properties.put(ProducerConfig.LINGER_MS_CONFIG, 10);
properties.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 10000);
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.getCanonicalName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, GenericEventSerializer.getCanonicalName());
kafkaProducer = new KafkaProducer<>(properties);
// And at some later point...
kafkaProducer.send(new ProducerRecord<>(TOPIC, event), (metadata, exception) -> {
if (exception != null)
{
LOGGER.error("Failed to write to Kafka", exception);
}
});
这是从中读取的代码:
Properties props = new Properties();
props.put("enable.auto.commit", false);
props.put("bootstrap.servers", kafkaHostString);
props.put("group.id", consumerGroupId);
props.put("request.timeout.ms", 15000);
props.put("session.timeout.ms", 10000);
props.put("max.poll.records", 10000);
props.put("batch.size", 6400000);
Consumer<String, GenericEvent> consumer = new KafkaConsumer<>(props, new StringDeserializer(), new GenericEventDeserializer());
consumer.subscribe(Collections.singleton(topic));
// And at some later point ...
records = consumer.poll(pollTimeout);
consumer.commitSync();
advertised.host.name
,advertised.port
和advertised.listeners
都在server.properties
中设置。