Question

考虑这个从主题接收数据的Kafka使用者，将其缓冲到PreparedStatement中，并在批处理10万条记录时，向数据库发出INSERT查询。

这很好，直到数据仍然传入为止。但是，例如当20K条记录被缓冲并且没有更多的记录传入时，它仍然等待更多的80K条记录，直到在 flushes 语句中为止。但是如果一段时间后停滞，我想刷新这20K 。我怎样才能做到这一点？我看不出如何挂上它。

例如，在使用基于librdkafka的php-rdkafka扩展名的PHP中，当达到分区末尾时，我得到RD_KAFKA_RESP_ERR__PARTITION_EOF，因此很容易在发生这种情况时挂接缓冲区刷新。

我试图简化代码，以便仅保留重要部分

public class TestConsumer {

    private final Connection connection;
    private final CountDownLatch shutdownLatch;
    private final KafkaConsumer<String, Message> consumer;
    private int processedCount = 0;

    public TestConsumer(Connection connection) {
        this.connection = connection;
        this.consumer = new KafkaConsumer<>(getConfig(), new StringDeserializer(), new ProtoDeserializer<>(Message.parser()));
        this.shutdownLatch = new CountDownLatch(1);
    }

    public void execute() {
        PreparedStatement statement;
        try {
            statement = getPreparedStatement();
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }

        Runtime.getRuntime().addShutdownHook(new Thread(() -> {
            commit(statement);

            consumer.wakeup();
        }));

        consumer.subscribe(Collections.singletonList("source.topic"));

        try {
            while (true) {
                ConsumerRecords<String, Message> records = consumer.poll(Duration.ofMillis(Long.MAX_VALUE));

                records.forEach(record -> {
                    Message message = record.value();
                    try {
                        fillBatch(statement, message);
                        statement.addBatch();
                    } catch (SQLException e) {
                        throw new RuntimeException(e);
                    }
                });

                processedCount += records.count();

                if (processedCount > 100000) {
                    commit(statement);
                }
            }
        } catch (WakeupException e) {
            // ignore, we're closing
        } finally {
            consumer.close();
            shutdownLatch.countDown();
        }
    }

    private void commit(PreparedStatement statement) {
        try {
            statement.executeBatch();
            consumer.commitSync();
            processedCount = 0;
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
    }


    protected void fillBatch(PreparedStatement statement, Message message) throws SQLException {
        try {
            statement.setTimestamp(1, new Timestamp(message.getTime() * 1000L));
        } catch (UnknownHostException e) {
            throw new RuntimeException(e);
        }
    }

Answer 1

我了解您这样的问题：

您想使用来自Kafka的消息
将它们最多存储10万条记录
批量提交到数据库
但是您只想等待t秒（让我们说10秒）

使用Kafka内置的消费者批处理功能，可以通过一种高效且可靠的方式来实现此目标。只要您能够以某种方式预测消息的平均大小（以字节为单位）即可。

在Kafka使用者配置上，您需要设置以下内容：

fetch.min.bytes =>这应该是100k x邮件的平均大小

fetch.max.wait.ms =>这是您的超时时间（以毫秒为单位）（例如，等待5秒钟为5000）

max.partition.fetch.bytes =>最高每个分区的数据量。这有助于优化总抓取大小

max.poll.records =>单个轮询中返回的最大记录数。可以设置为100K

fetch.max.bytes =>如果您要为单个请求设置上限

这样，如果它们符合定义的字节大小，则最多可以获取100K条记录，但是它将等待可配置的毫秒数。

民意调查返回记录后，您可以一次性保存并重复。

当主题

1 个答案: