Question

我有以下情况：

使用FK将表A和表B连接起来。
对A和B进行事务性插入/更新。
Debezium为表A发出一个事件a，为表B发出一个事件b。
Kafka Streams为表A和B创建KStream。
Kafka Streams应用程序leftJoin KStreams A和B。（让我们假设a和b记录具有相同的键并落在联接窗口中）。
输出记录将为[a, null], [a, b]。

如何丢弃[a, null]？

一种选择是执行innerJoin，但是在进行update查询的情况下仍然是个问题。

我们尝试使用事件时间戳进行过滤（即，使事件保持最新的时间戳），但不能保证时间戳的唯一性。

即。最终目标是能够识别最新的汇总，以便我们可以在查询时（在Athena / Presto或某些RDBMS中）过滤掉中间结果。

Answer 1

目前，我发现最好的工作方法是利用输出记录中的Kafka偏移量。

该方法可以概括为：

执行所有您想做的逻辑，不用担心同一键有多个记录。
将结果写入保留时间极短（例如1小时等）的中间主题
使用处理器阅读中间主题，并在处理器内，使用context.offset()通过Kafka偏移量丰富消息。
将消息写到输出主题。

现在，您的输出主题包含多条针对同一键的消息，但每条消息具有不同的偏移量。

现在在查询期间，您可以使用子查询为每个键选择最大偏移量。

下面可以看到一个示例TransformerSupplier

/**
 * @param <K> key type
 * @param <V> value type
 */
public class OutputTransformSupplier<K, V> implements TransformerSupplier<K, V, KeyValue<String, String>> {
  @Override
  public Transformer<K, V, KeyValue<String, String>> get() {
    return new OutputTransformer<>();
  }

  private class OutputTransformer<K, V> implements Transformer<K, V, KeyValue<String, String>> {
    private ProcessorContext context;

    @Override
    public void init(ProcessorContext context) {
      this.context = context;
    }

    /**
     * @param key   the key for the record
     * @param value the value for the record
     */
    @Override
    public KeyValue<String, String> transform(K key, V value) {
      if (value != null) {
        value.setKafkaOffset(context.offset());
      }
      return new KeyValue<>(key, value);
    }

    @Override
    public KeyValue<String, String> punctuate(long timestamp) {
      return null;
    }

    @Override
    public void close() {
      // nothing to close
    }
  }
}

在Kafka Streams中对KStream-KStream连接的重复数据删除中间结果

1 个答案: