Question

我正在尝试找到一种方法来重新排序主题分区中的消息，并将有序消息发送到新主题。

我有Kafka发布者发送以下格式的String消息： {system_timestamp}-{event_name}?{parameters}

例如：

1494002667893-client.message?chatName=1c&messageBody=hello
1494002656558-chat.started?chatName=1c&chatPatricipants=3

此外，我们为每条消息添加一些消息密钥，以将它们发送到相应的分区。

我想要做的是根据消息的 {system-timestamp} 部分重新排序事件，并在1分钟的时间内，因为我们的发布商不保证会发送消息符合 {system-timestamp} 值。

例如，我们可以首先向主题提供一条带有较大 {system-timestamp} 值的邮件。

我已经调查了Kafka Stream API并找到了一些关于消息窗口化和聚合的例子：

Properties streamsConfiguration = new Properties();
        streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "stream-sorter");
        streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        streamsConfiguration.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
        streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
        streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());

 KStreamBuilder builder = new KStreamBuilder();
 KStream<String, String> stream = builder.stream("events");
 KGroupedStream<String>, String> groupedStream = stream.groupByKey();//grouped events within partion.

    /* commented since I think that I don't need any aggregation, but I guess without aggregation I can't use time windowing.
KTable<Windowed<String>, String> windowedEvents = stream.groupByKey().aggregate(
                () -> "",  // initial value
                (aggKey, value, aggregate) -> aggregate + "",   // aggregating value
                TimeWindows.of(1000), // intervals in milliseconds
                Serdes.String(), // serde for aggregated value
                "test-store"
        );*/

但接下来我应该用这个分组流做什么？我没有看到任何' sort（）（e1，e2） - ＆gt; e1.compareTo（e2）'方法可用，也可以将窗口应用于 aggregation（）， reduce（）， count（）等方法，但我认为我不需要任何消息数据操作。

如何在1分钟的窗口中重新订购消息并将其发送到其他主题？

Answer 1

这是一个大纲：

创建一个处理器实现：

，对于每条消息：
- 从消息值
- 使用（timestamp，message-key）对作为键并将message-value作为值插入KeyValueStore。注意，这也提供了重复数据删除。您需要提供一个自定义Serde来序列化密钥，以便时间戳首先按字节顺序排列，以便远程查询按时间戳排序。
：
- 使用从0到时间戳的范围提取来读取商店 - 60'000（= 1分钟）
- 使用context.forward（）按顺序发送提取的消息，并从商店中删除它们

这种方法的问题是如果没有新的消息到达以提前“流时间”，则不会触发punctuate（）。如果这是您的情况下的风险，您可以创建一个外部调度程序，向您的主题的每个（！）分区发送定期“滴答”消息，您的处理器应该忽略，但是它们会在没有时触发标点符号“真实的”消息。 KIP-138将通过添加对系统时间标点符号的明确支持来解决此限制： https://cwiki.apache.org/confluence/display/KAFKA/KIP-138%3A+Change+punctuate+semantics

Answer 2

这是我在项目中订购流的方式。

使用源，处理器，接收器创建拓扑。
在处理器中
1. 过程（键，值）->将每个记录添加到列表（实例变量）。
2. Init（）->调度（WINDOW_BUFFER_TIME，WALL_CLOCK_TIME）->在列表（实例变量）中对窗口缓冲时间项的列表进行打标（时间戳）排序，然后迭代和转发。清除列表（实例变量）。

这种逻辑对我来说很好。

Apache Kafka订单根据其值来窗口化消息

2 个答案: