Question

我有一个数据流管道，该管道从pubsub主题读取，执行转换并写入BigTable。我希望从pubsub中读取的元素按其序号顺序进行处理。

我正在使用2分钟的固定窗口，然后在其上应用GroupByKey。在GBK之后，我使用了SortValues转换，该转换对SequenceNumber上的Iterable进行排序。我观察到GroupByKey步骤的耗时很长，因为窗口中的所有元素都是在同一worker上处理的。有没有一种在固定窗口内对元素进行排序的有效方法？

以下是我的管道代码：

PCollection<PubsubMessage> pubsubRecords = p.apply(PubsubIO.readMessagesWithAttributes()
                    .fromTopic(StaticValueProvider.of(topic)));
            PCollection<KV<BigInteger, JSONObject>> window = pubsubRecords.apply("Raw to String", ParDo.of(new LogsFn()))
                    .apply("Window", Window
                            .<KV<BigInteger, JSONObject>>into(FixedWindows.of(Duration.standardMinutes(2)))
                            .triggering(Repeatedly
                                .forever(AfterProcessingTime
                                    .pastFirstElementInPane()
                                    .plusDelayOf(Duration.StandardMinutes(2))
                                )
                            )
                            .withAllowedLateness(Duration.ZERO).discardingFiredPanes()
                        );
            PCollection<KV<String, KV<BigInteger, JSONObject>>> keyedWindow = window
                    .apply(WithKeys.of(new SerializableFunction<KV<BigInteger, JSONObject>,String>() {
                          @Override
                          public String apply(KV<BigInteger, JSONObject> row) {
                            return "key";
                          }
                    }));

            PCollection<KV<String, Iterable<KV<BigInteger, JSONObject>>>> groupedWindow = keyedWindow
                    .apply(GroupByKey.<String, KV<BigInteger, JSONObject>>create()).apply(
                            SortValues.<String, BigInteger, JSONObject>create(BufferedExternalSorter.options()));

Answer 1

我认为您的方法是正确的。不可避免的是，所有元素必须在同一工作程序中进行排序。顺序处理会在数据之间建立依赖关系，并且通常不适用于分布式计算。

在固定窗口内对元素进行排序-Cloud Dataflow

1 个答案: