我有一个Kafka流应用程序,可以使用1个源主题和20个分区。流量负载约为2K记录/秒。我将应用程序部署到了63个实例,并且运行良好。但是我注意到,分区分配总是在变化。我检查了每个实例的KafkaStreams#localTheadMetadata
输出,响应始终是PARTITIONS_REVOKED
或PARTITIONS_ASSIGNED
,有时是RUNNING
。
从日志中,我看到了两个不同的错误:
Offset commit failed on partition production-smoke-KSTREAM-REDUCE-STATE-STORE-0000000015-repartition-13 at offset 25010: The coordinator is not aware of this member.
org.apache.kafka.streams.errors.StreamsException: task [2_13] Abort sending since an error caught with a previous record (key 264344038933 value [B@359a4692 timestamp 1563412782970) to topic production-smoke-KTABLE-SUPPRESS-STATE-STORE-0000000021-changelog due to org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
You can increase producer parameter `retries` and `retry.backoff.ms` to avoid this error.```
该应用程序仍在运行,并向下游主题发送消息。我的理解是,由于我的应用程序不需要63个节点,因此某些节点处于空闲状态,并且一旦一个节点由于上述错误而死亡,则会触发重新平衡。是正确的吗? (某些节点在调用KafkaStreams is not running. State is ERROR.
时返回KafkaStreams#localTheadMetadata
。一旦所有节点都失效,应用程序将完全失效吗?
任何人都可以帮助您了解解决上述错误的正确方法吗?增加retries
和retry.backoff.ms
会掩盖一些更大的问题吗?
这是我的配置: 生产者配置:
buffered.records.per.partition = 1000
cache.max.bytes.buffering = 10485760
commit.interval.ms = 30000
connections.max.idle.ms = 540000
max.task.idle.ms = 0
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.sample.window.ms = 30000
num.standby.replicas = 0
num.stream.threads = 1
partition.grouper = class org.apache.kafka.streams.processor.DefaultPartitionGrouper
poll.ms = 100
processing.guarantee = at_least_once
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
replication.factor = 1
request.timeout.ms = 60000
retries = 20
retry.backoff.ms = 60000
rocksdb.config.setter = null
send.buffer.bytes = 131072
state.cleanup.delay.ms = 600000
state.dir = /tmp/kafka-streams
topology.optimization = none
upgrade.from = null
windowstore.changelog.additional.retention.ms = 86400000
消费者配置:
auto.commit.interval.ms = 5000
auto.offset.reset = none
check.crcs = true
client.dns.lookup = default
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id =
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = false
isolation.level = read_uncommitted
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 2147483647
max.poll.records = 1000
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 60000
retry.backoff.ms = 60000
非常感谢!