Question

我有使用Kafka Streams处理的记录（使用Processor API）。假设记录中包含city_id和其他一些字段。

在Kafka Streams应用中，我想将目标城市的当前温度添加到记录中。
Temperature<->City对存储在例如。 Postgres。

在Java应用程序中，我能够使用JDBC连接到Postgres并构建new HashMap<CityId, Temperature>，因此我能够基于city_id查找温度。像tempHM.get(record.city_id)之类的东西。

如何最好地解决它有几个问题：

在哪里初始化上下文数据？

最初，我一直在AbstractProcessor::init()中进行此操作，但这似乎是错误的，因为它已为每个线程初始化，并在重新平衡时重新初始化。

因此，我在用它构建流拓扑生成器和处理器之前将其移动了。在所有处理器实例上，数据只能独立获取一次。

是正确有效的方法吗？可以，但是...

HashMap<CityId, Temperature> tempHM = new HashMap<CityId, Temperature>;

// Connect to DB and initialize tempHM here

Topology topology = new Topology();

topology
    .addSource(SOURCE, stringDerializer, protoDeserializer, "topic-in")

    .addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(tempHm), SOURCE)

    .addSink(SINK, "topic-out", stringSerializer, protoSerializer, TemperatureAppender.NAME)
;

如何刷新上下文数据？

例如，我想每15分钟刷新一次温度数据。我当时正在考虑使用Hashmap容器而不是Hashmap来处理它：

abstract class ContextContainer<T> {

    T context;
    Date lastRefreshAt;

    ContextContainer(Date now) {
        refresh(now);
    }

    abstract void refresh(Date now);

    abstract Duration getRefreshInterval();

    T get() {
        return context;
    }

    boolean isDueToRefresh(Date now) {
        return lastRefreshAt == null
            || lastRefreshAt.getTime() + getRefreshInterval().toMillis() < now.getTime();
    }
}

final class CityTemperatureContextContainer extends ContextContainer<HashMap> {

    CityTemperatureContextContainer(Date now) {
        super(now);
    }

    void refresh(Date now) {
        if (!isDueToRefresh(now)) {
            return;
        }

        HashMap context = new HashMap();
        // Connect to DB and get data and fill hashmap

        lastRefreshAt = now;
        this.context = context;
    }

    Duration getRefreshInterval() {
        return Duration.ofMinutes(15);
    }
}

这是用SO文本区域编写的简短概念，可能包含一些语法错误，但我希望重点很明确

然后将其传递到.addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(cityTemperatureContextContainer), SOURCE)

之类的处理器中

在处理器中执行

    public void init(final ProcessorContext context) {
        context.schedule(
            Duration.ofMinutes(1),
            PunctuationType.STREAM_TIME,
            (timestamp) -> { 
                cityTemperatureContextContainer.refresh(new Date(timestamp));
                tempHm = cityTemperatureContextContainer.get();
            }    
        );

        super.init(context);
    }

有更好的方法吗？主要问题是要找到合适的概念，然后我就可以实施它。但是，关于该主题的资源并不多。

Answer 1

在Kafka Streams应用程序中，我想将目标城市的当前温度添加到记录中。 Temperature<->City对存储在例如。 Postgres。

在Java应用程序中，我能够使用JDBC连接到Postgres并构建新的HashMap<CityId, Temperature>，因此我能够基于city_id查找温度。像tempHM.get(record.city_id)之类的东西。

一个更好的选择是使用Kafka Connect将来自Postgres的数据吸收到Kafka主题中，将该主题读入应用程序中的KTable和Kafka Streams，然后将此KTable与您的另一个流（“带有city_id和其他字段”的记录流）。也就是说，您将进行KStream到KTable的联接。

思考：

### Architecture view

DB (here: Postgres) --Kafka Connect--> Kafka --> Kafka Streams Application


### Data view

Postgres Table ----------------------> Topic --> KTable

您的用例的示例连接器是https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc和https://www.confluent.io/hub/debezium/debezium-connector-postgresql。

上述基于Kafka Connect的设置的优点之一是，您不再需要直接从Java应用程序（使用Kafka Streams）与Postgres DB通信。

另一个优点是您无需从数据库中将上下文数据（每15分钟提到一次）“批量刷新”到Java应用程序中，因为该应用程序将实时获取最新的数据库更改通过DB-> KConnect-> Kafka-> KStreams-app流自动

。

正确的方法如何将来自外部源的上下文添加到Kafka Streams中的记录

在哪里初始化上下文数据？

如何刷新上下文数据？

1 个答案: