Kafka流 - 连接两个ktables调用连接函数两次

时间:2017-01-02 16:00:07

标签: apache-kafka apache-kafka-streams

我正在尝试加入2个KTables。

KTable<String, RecordBean> recordsTable = builder.table(Serdes.String(),
    new JsonPOJOSerde<>(RecordBean.class),
    bidTopic, RECORDS_STORE);

KTable<String, ImpressionBean> impressionsTable = builder.table(Serdes.String(),
    new JsonPOJOSerde<>(ImpressionBean.class),
    impressionTopic, IMPRESSIONS_STORE);

KTable<String, RecordBean> mergedByTxId = recordsTable
    .join(impressionsTable, merge());

合并函数非常简单,我只是将值从一个bean复制到另一个bean。

public static <K extends BidInfo, V extends BidInfo> ValueJoiner<K, V, K> merge() {
return (v1, v2) -> {
  v1.setRtbWinningBidAmount(v2.getRtbWinningBidAmount());
  return v1;
};

但由于某些原因,join函数在单个生成的记录上调用两次。 请参阅下面的流媒体/制作人配置

Properties streamsConfiguration = new Properties();
streamsConfiguration
    .put(StreamsConfig.APPLICATION_ID_CONFIG, "join-impressions");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, CLUSTER.bootstrapServers());

streamsConfiguration.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, CLUSTER.zookeeperConnect());
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG, folder.newFolder("kafka-streams-tmp")
    .getAbsolutePath());

return streamsConfiguration;

生产者配置 -

Properties producerConfig = new Properties();
producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, CLUSTER.bootstrapServers());
producerConfig.put(ProducerConfig.ACKS_CONFIG, "all");
producerConfig.put(ProducerConfig.RETRIES_CONFIG, 0);
producerConfig.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
producerConfig.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);

return producerConfig;

接下来我按流提交单个记录。两个记录都有相同的键。 我期待收到单个记录作为输出。

 IntegrationTestUtils.produceKeyValuesSynchronously(bidsTopic,
    Arrays.asList(new KeyValue("1", getRecordBean("1"))),
    getProducerProperties());

IntegrationTestUtils.produceKeyValuesSynchronously(impressionTopic,
    Arrays.asList(new KeyValue("1", getImpressionBean("1"))),
    getProducerProperties());

List<KeyValue<String, String>> parsedRecord =
    IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(getConsumerProperties(),
        outputTopic, 1);

但ValueJoiner触发2次,我得到2个相同的输出记录而不是1。在触发时间期间 - 来自两个流的两个值都存在 - 并且我无法获得触发第二次执行的内容。

没有加入 - 我无法重现这种行为。 我找不到2个ktable join的任何工作示例 - 所以无法理解我的方法有什么不对。

添加演示相同行为的简单代码

KStreamBuilder builder = new KStreamBuilder();

KTable<String, String> first = builder.table("stream1", "storage1");
KTable<String, String> second = builder.table("stream2", "storage2");

KTable<String, String> joined = first.join(second, (value1, value2) -> value1);

joined.to("output");

KafkaStreams streams = new KafkaStreams(builder, getStreamingProperties());

streams.start();

IntegrationTestUtils.produceKeyValuesSynchronously("stream1",
    Arrays.asList(new KeyValue("1", "first stream")),
    getProducerProperties());

IntegrationTestUtils.produceKeyValuesSynchronously("stream2",
    Arrays.asList(new KeyValue("1", "second stream")),
    getProducerProperties());

List<KeyValue<String, String>> parsedRecord =
    IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(getConsumerProperties(),
        "output", 1);

2 个答案:

答案 0 :(得分:2)

在向Confluent邮件群组发布类似问题后,我得到了以下解释。

  

我认为这可能与缓存有关。 2个表的缓存是独立刷新的,因此有可能两次获得相同的记录。如果stream1和stream2都接收到相同密钥的记录,并且缓存刷新,则:

     

来自stream1的缓存将刷新,执行连接并生成记录。

     

来自stream2的缓存将刷新,执行连接并生成记录。

     

从技术上讲,这是正常的,因为连接的结果是另一个KTable,因此KTable中的值将是正确的值。

将以下变量设置为0 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG后,0 - 问题已解决。 我仍然有2条记录 - 但是现在有一条记录与null连接 - 根据上面提供的连接语义文档,它的行为非常清晰。

答案 1 :(得分:0)

我在两个KTable之间使用leftJoin发现了相同的行为,并且在谷歌搜索后偶然发现了该帖子。我不知道您使用的是什么版本的kafka-streams,但是在调试了融合代码之后,kafka-streams 2.0.1版似乎在某些类型的联接中故意发送了旧值和新值,因此您会收到两个调用ValueJoiner。

看看构建连接拓扑的module Api::V1 class PickupDeliveriesController < ApplicationController before_action :set_pickup_delivery, only: [:show, :update, :destroy] # GET /pickup_deliveries def index @pickup_deliveries = PickupDelivery.all render json: @pickup_deliveries end # GET /pickup_deliveries/1 def show render json: @pickup_delivery end # POST /pickup_deliveries def create @pickup_delivery = PickupDelivery.new(pickup_delivery_params) if @pickup_delivery.save render json: @pickup_delivery, status: :created, location: @pickup_delivery_url else render json: @pickup_delivery.errors, status: :unprocessable_entity end end # PATCH/PUT /pickup_deliveries/1 def update if @pickup_delivery.update(pickup_delivery_params) render json: @pickup_delivery else render json: @pickup_delivery.errors, status: :unprocessable_entity end end # DELETE /pickup_deliveries/1 def destroy @pickup_delivery.destroy end private # Use callbacks to share common setup or constraints between actions. def set_pickup_delivery @pickup_delivery = PickupDelivery.find(params[:id]) end # Only allow a trusted parameter "white list" through. def pickup_delivery_params params.require(:pickup_delivery).permit(:id, :pickup_date, :pickup_location, :rate, :delivery_date, :delivery_location, :local, :loaded_miles, :deadhead_miles, :delivery_id, :pickup_zip, :delivery_zip) end end end 的实现以及运行时分派拓扑的org.apache.kafka.streams.kstream.internals.KTableImpl#buildJoin的实现。在某些情况下,显然要做两次。

以下是这种行为https://issues.apache.org/jira/browse/KAFKA-2984

的背景