在Kafka流实例端反复收到NotCoordinatorException

时间:2018-08-18 13:20:21

标签: apache-kafka apache-kafka-streams

我们正在使用Kafka版本:0.10.2.0 kafka流版本为1.1.0
在消费者认为是组协调器的kafka经纪人机器上,我们看到以下日志行::

2018-08-18 11:54:12,693 [kafka-request-handler-5] TRACE (Logging.scala:36) - [KafkaApi-48476987] Sending join group response {error_code=16,generation_id=0,group_protocol=,leader_id=,member_id=,members=[]} for correlation id 538 to client chitraguptaV1-dcf33a6c-368e-472e-aee7-120f1216aa3f-StreamThread-2-consumer.

并且由于NotCoordinatorException是可重试的异常,因此它会不断地重试,并且组协调器会不断向消费者客户端发送相同的错误代码(error_code = 16)

https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L380

我们在kafka流实例侧看到以下日志行

Discovered group coordinator <groupcoordinator_host_name>:9092 (id: 2099006660 rack: null)
(Re-)joining group
Group coordinator <groupcoordinator_host_name>:9092 (id: 2099006660 rack: null) is unavailable or invalid, will attempt rediscovery
InitiateJoinGroup request failed This is not the correct coordinator.

请帮助我们解决此问题。

我们的Kafka集群配置:

delete.topic.enable:        true
auto.create.topics.enable:      true
unclean.leader.election.enable:     false
controlled.shutdown.enable:     true
controlled.shutdown.max.retries:        3
controlled.shutdown.retry.backoff.ms:       5000
default.replication.factor:     1
offsets.topic.num.partitions:       200
offsets.topic.replication.factor:       3
offsets.retention.check.interval.ms:        600000
offsets.commit.timeout.ms:      5000
num.network.threads:        3
num.replica.fetchers:       2
num.io.threads:     8
socket.send.buffer.bytes:       8388608
socket.receive.buffer.bytes:        8388608
socket.request.max.bytes:       314572800
log.retention.hours:        4
log.retention.bytes:        10737418240
log.segment.bytes:      536870912
log.cleanup.policy:     delete
zookeeper.connection.timeout.ms:        6000
zookeeper.session.timeout.ms:       6000
zookeeper.sync.time.ms:     2000
queued.max.requests:        500
replica.lag.time.max:       10000
replica.fetch.wait.max.ms:      500
min.insync.replicas:        2
replica.fetch.max.bytes:        67108864
message.max.bytes:      67108864
replica.high.watermark.checkpoint.interval.ms:      5000
replica.socket.timeout.ms:      30000
replica.socket.receive.buffer.bytes:        65536

我们的Kafka Streams侧面配置:

props.setProperty(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, "1");
props.setProperty(StreamsConfig.STATE_CLEANUP_DELAY_MS_CONFIG, "1800000");
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, WallclockTimestampExtractor.class.getName());
props.put(StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS_CONFIG, CustomRocksDBConfig.class);
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "3");
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, "2");
props.put(StreamsConfig.TOPIC_PREFIX + TopicConfig.CLEANUP_POLICY_CONFIG, TopicConfig.CLEANUP_POLICY_DELETE);
props.put(StreamsConfig.TOPIC_PREFIX + TopicConfig.RETENTION_MS_CONFIG, "43200000");
props.put(StreamsConfig.TOPIC_PREFIX + TopicConfig.COMPRESSION_TYPE_CONFIG, "lz4");
props.put(StreamsConfig.CONSUMER_PREFIX + ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "240000");
props.put(StreamsConfig.REQUEST_TIMEOUT_MS_CONFIG, "300000");
props.put(StreamsConfig.RETRIES_CONFIG, "20");
props.put(StreamsConfig.RETRY_BACKOFF_MS_CONFIG, "2400");
props.put(ProducerConfig.BATCH_SIZE_CONFIG, "1048576");
props.put(ProducerConfig.LINGER_MS_CONFIG, "2400");
props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, "300000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

0 个答案:

没有答案