Question

作为流应用程序中的最后一步，我想对系统中的乱序事件进行排序。为此，我使用了：

events.keyBy((Event event) -> event.id)
                .process(new SortFunction())
                .print();

sort函数在哪里：

public static class SortFunction extends KeyedProcessFunction<String, Event, Event> {
        private ValueState<PriorityQueue<Event>> queueState = null;

        @Override
        public void open(Configuration config) {
            ValueStateDescriptor<PriorityQueue<Event>> descriptor = new ValueStateDescriptor<>(
                    // state name
                    "sorted-events",
                    // type information of state
                    TypeInformation.of(new TypeHint<PriorityQueue<Event>>() {
                    }));
            queueState = getRuntimeContext().getState(descriptor);
        }

        @Override
        public void processElement(Event event, Context context, Collector<Event> out) throws Exception {
            TimerService timerService = context.timerService();

            if (context.timestamp() > timerService.currentWatermark()) {
                PriorityQueue<Event> queue = queueState.value();
                if (queue == null) {
                    queue = new PriorityQueue<>(10);
                }
                queue.add(event);
                queueState.update(queue);
                timerService.registerEventTimeTimer(event.timestamp);
            }
        }

        @Override
        public void onTimer(long timestamp, OnTimerContext context, Collector<Event> out) throws Exception {
            PriorityQueue<Event> queue = queueState.value();
            Long watermark = context.timerService().currentWatermark();
            Event head = queue.peek();
            while (head != null && head.timestamp <= watermark) {
                out.collect(head);
                queue.remove(head);
                head = queue.peek();
            }
        }
    }

我现在想做的是尝试使其并行化。我目前的想法是执行以下操作：

    events.keyBy((Event event) -> event.id)
                    .rebalance()
                    .process(new SortFunction()).setParalelism(3)
                    .map(new KWayMerge()).setParalelism(1).
                    .print();

如果我理解的是正确的，那么在这种情况下应该发生的事情，如果我错了，请纠正我，是对于给定键（理想情况下为1/3）的每个事件的一部分将进入每个为了进行完整排序，SortFunction的并行实例需要创建一个map或另一个processFunction，以从3个不同的实例接收已排序的事件并将它们合并一起回来。

如果是这样，是否有任何方法可以区分map接收到的事件的起源，以便我可以对map进行三向合并？如果那不可能，我的下一个想法是将PriorityQueue换成TreeMap并将所有内容放到一个窗口中，这样一旦3个TreeMaps出现，合并就会在窗口的末尾发生已收到。如果选项a不可行，这个其他选项是否有意义？或者有更好的解决方案来做类似的事情吗？

Answer 1

首先，您应该意识到，仅当使用基于堆的状态后端时，才可以在Flink ValueState中使用PriorityQueue或TreeMap。在RocksDB的情况下，这将表现得很糟糕，因为PriorityQueues将在每次访问时反序列化，并在每次更新时重新序列化。通常，我们建议基于MapState进行排序，这就是Flink库中实现排序的方式。

此代码的作用

events.keyBy((Event event) -> event.id)
            .process(new SortFunction())

用于根据键对流进行独立排序-输出将针对每个键进行排序，但不会全局进行排序。

另一方面，

events.keyBy((Event event) -> event.id)
                .rebalance()
                .process(new SortFunction()).setParalelism(3)

不起作用，因为重新平衡的结果不再是KeyedStream，并且SortFunction取决于键控状态。

此外，我不认为对流的1/3进行3种排序，然后合并结果，将不会比单个全局排序好得多。如果需要进行全局排序，则可能需要考虑使用Table API。有关示例，请参见the answer here。

合并重新平衡的分区

1 个答案: