如何防止Kafka Connect JDBC连接器从开始/最早的消息中读取,而是仅读取最新消息?

时间:2019-07-29 21:49:14

标签: apache-kafka apache-kafka-connect

我的问题

我正在以独立模式(即JDBC连接器插件v5.2.1)运行Kafka Connect v2.12。

我正在使用JSON序列化,并将模式嵌入到有效负载中。

由于JDBC连接器在设计上是快速失败的,因此如果进入FAILED状态,并且如果收到的消息有问题,则停止处理消息。

如果JDBC连接器仅处理最新消息,则可以。错误的消息会使它崩溃,然后可以重新启动,并在下一条(希望是结构良好的)消息中接收。

但是,我的JDBC连接器已经启动*,并且在启动时读取所有历史消息。我在启动日志中注意到auto.offset.reset被设置为earliest。这很奇怪,因为默认值为latest,并且在我的consumer.auto.offset.reset文件中未将worker.properties设置为earliestlatest或其他设置。无论如何,我编辑了worker.properties文件,以将consumer.auto.offset.reset设置为latest,如下所示。

从启动日志现在显示auto.offset.reset=latest的角度来看,此更改是成功的,但是连接器在每次启动时都将崩溃,因为它尝试处理格式不正确的JSON已有一周的消息。

我应该修改哪些设置以使我的Kafka Connect工作者仅提取自该工作者启动以来发送的Kafka消息?

* 直到上周,连接器仅读取最新消息。 IDK是我弄乱了配置中的某些内容,还是其他人更改了Kafka代理的全局设置,但自上周以来,它一直在每次启动时从最早的消息开始读取所有消息。


我的配置

worker.properties

# This file was based from https://github.com/boundary/dropwizard-kafka/blob/master/kafka-connect/src/main/resources/kafka-connect/example.standalone.worker.properties

offset.storage.file.filename=/tmp/example.offsets

bootstrap.servers=kafka-0.kafka-headless.kafka:9092,kafka-1.kafka-headless.kafka:9092,kafka-2.kafka-headless.kafka:9092
offset.flush.interval.ms=10000

rest.port=8083
rest.advertised.port=8083

key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=true

# Prevent the connector from pulling all historical messages
consumer.auto.offset.reset=latest

我的日志

有关信息,这是历史消息的列表。猜猜是哪个消息触发了错误,哈哈。

{"schema":{"type":"struct","fields":[{"type":"int64","optional":false,"field":"id"},{"type":"string","optional":false,"field":"status"}],"optional":false,"name":"example_topic"},"payload":{"id":1337,"status":"success"}}
{"schema":{"type":"struct","fields":[{"type":"int64","optional":false,"field":"id"},{"type":"string","optional":false,"field":"status"}],"optional":false,"name":"example_topic"},"payload":{"id":1337,"status":"success"}}
{"schema":{"type":"struct","fields":[{"type":"int64","optional":false,"field":"id"},{"type":"string","optional":false,"field":"status"}],"optional":false,"name":"example_topic"},"payload":{"id":1337,"status":"success"}}
{"schema":{"type":"struct","fields":[{"type":"int64","optional":false,"field":"id"},{"type":"string","optional":false,"field":"status"}],"optional":false,"name":"example_topic"},"payload":{"id":1337,"status":"success"}}
kafka_connect/bin/kafka-console-producer.sh \
      --broker-list kafka-0.kafka-headless.kafka:9092,kafka-1.kafka-headless.kafka:9092,kafka-2.kafka-headless.kafka:9092 \
      --topic example_topic
{"schema":{"type":"struct","fields":[{"type":"int64","optional":false,"field":"id"},{"type":"string","optional":false,"field":"status"}],"optional":false,"name":"example_topic"},"payload":{"id":1337,"status":"success"}}

以及启动时出现的日志和错误消息:

[2019-07-30 14:20:55,020] INFO Initializing writer using SQL dialect: PostgreSqlDatabaseDialect (io.confluent.connect.jdbc.sink.JdbcSinkTask:57)
[2019-07-30 14:20:55,021] INFO WorkerSinkTask{id=postgres_sink-0} Sink task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:302)
[2019-07-30 14:20:55,031] INFO Cluster ID: DPpxrPbVR5qwiI9vz_Gkkw (org.apache.kafka.clients.Metadata:273)
[2019-07-30 14:20:55,031] INFO [Consumer clientId=consumer-1, groupId=connect-postgres_sink] Discovered group coordinator kafka-2.kafka-headless.kafka:9092 (id: 2147483645 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:677)
[2019-07-30 14:20:55,033] INFO [Consumer clientId=consumer-1, groupId=connect-postgres_sink] Revoking previously assigned partitions [] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:462)
[2019-07-30 14:20:55,033] INFO [Consumer clientId=consumer-1, groupId=connect-postgres_sink] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:509)
[2019-07-30 14:20:58,046] INFO [Consumer clientId=consumer-1, groupId=connect-postgres_sink] Successfully joined group with generation 884 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:473)
[2019-07-30 14:20:58,048] INFO [Consumer clientId=consumer-1, groupId=connect-postgres_sink] Setting newly assigned partitions [example_topic-0] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:280)
[2019-07-30 14:20:58,097] ERROR WorkerSinkTask{id=postgres_sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:177)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
    at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
    at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:513)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error: 
    at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:334)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:513)
    at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
    at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
    ... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'kafka_connect': was expecting ('true', 'false' or 'null')
 at [Source: (byte[])"kafka_connect/bin/kafka-console-producer.sh \"; line: 1, column: 15]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'kafka_connect': was expecting ('true', 'false' or 'null')
 at [Source: (byte[])"kafka_connect/bin/kafka-console-producer.sh \"; line: 1, column: 15]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:703)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3532)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2627)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:832)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:729)
    at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4042)
    at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2571)
    at org.apache.kafka.connect.json.JsonDeserializer.deserialize(JsonDeserializer.java:50)
    at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:332)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:513)
    at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
    at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
    at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:513)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

2 个答案:

答案 0 :(得分:1)

我相信您面临的问题是,设置“ consumer.auto.offset.reset”仅适用于没有存储偏移量的使用者组。

例如,假设您有一个第一次启动的消费者组。它查找已存储的偏移量,但找不到它们,因此它查看“ consumer.auto.offset.reset”设置。对于此示例,假设它已设置为“最早”,因此使用者从日志的开头开始,处理一些消息并提交偏移量(标准使用者操作)。接下来,您决定不希望这样做,因此设置“ consumer.auto.offset.reset = latest”并重新启动。消费者组再次查找其偏移量,这次是找到它的,因为它先前已提交了偏移量,因此它不会查看偏移量设置(您确实已将其设置为“最新”,但是由于存在已提交的偏移量,因此该设置将被忽略)。

可能由于某种原因,您的原始消费者使用的是“最早的”,而现在您已经为消费者组提交了抵消额,所以您不能在最晚开始。

如果要解决此问题,可以更改使用者组的名称(我不确定KafkaConnect是否公开了此名称),也可以使用Kafka随附的kafka-consumer-groups.sh脚本将偏移量设置为“最新”。

希望这会有所帮助。

答案 1 :(得分:0)

Kafka Connect处理消息包含三个阶段(接收器连接器):

  1. 转换邮件(使用Converter
  2. 转换邮件(使用Transformation
  3. 将消息写到外部系统

如果消息无效(Converter无法正确转换),则可以设置属性以跳过该消息。

"errors.tolerance": "all",
"errors.log.enable":true,
"errors.log.include.messages":true 

根据Kafka Connect配置:

errors.tolerance:

  

在连接器操作期间容忍错误的行为。 “无”是   预设值,并表示任何错误都会导致   立即连接器任务失败; “全部”将行为更改为跳过   有问题的记录。

errors.log.enable:

  

如果为true,请写下每个错误以及失败操作的详细信息,以及   有问题的记录到Connect应用程序日志中。这是“假”   默认设置,以便仅报告不容许的错误。

errors.log.include.messages:

  

是否在日志中包含导致   失败。默认情况下为“ false”,这将阻止记录键,   值,以及将其写入日志文件的标头,尽管有些   诸如主题和分区号之类的信息仍将被记录。