Question

我目前正在使用Apache Beam和Google Dataflow来处理实时数据。这些数据来自谷歌PubSub，这是无限制的，所以目前我正在使用流媒体管道。然而，事实证明，拥有全天候运行的流式传输管道非常昂贵。为了降低成本，我考虑切换到以固定时间间隔（例如每30分钟）运行的批处理管道，因为对于用户来说处理是实时的并不重要。

我想知道是否可以将PubSub订阅用作有限来源？我的想法是，每次运行作业时，它会在触发前累积数据1分钟。到目前为止，这似乎不可能，但我遇到了一个名为BoundedReadFromUnboundedSource的课程（我不知道如何使用），所以也许有办法？

下面粗略地说明了源代码：

PCollection<MyData> data = pipeline
            .apply("ReadData", PubsubIO
                    .readMessagesWithAttributes()
                    .fromSubscription(options.getInput()))
            .apply("ParseData", ParDo.of(new ParseMyDataFn()))
            .apply("Window", Window
                    .<MyData>into(new GlobalWindows())
                    .triggering(Repeatedly
                            .forever(AfterProcessingTime
                                    .pastFirstElementInPane()
                                    .plusDelayOf(Duration.standardSeconds(5))
                            )
                    )
                    .withAllowedLateness(Duration.ZERO).discardingFiredPanes()
            );

我尝试执行以下操作，但作业仍然以流模式运行：

PCollection<MyData> data = pipeline
            .apply("ReadData", PubsubIO
                    .readMessagesWithAttributes()
                    .fromSubscription(options.getInput()))
            .apply("ParseData", ParDo.of(new ParseMyDataFn()))

            // Is there a way to make the window trigger once and turning it into a bounded source?
            .apply("Window", Window
                    .<MyData>into(new GlobalWindows())
                    .triggering(AfterProcessingTime
                        .pastFirstElementInPane()
                        .plusDelayOf(Duration.standardMinutes(1))
                    )
                    .withAllowedLateness(Duration.ZERO).discardingFiredPanes()
            );

Answer 1

目前PubsubIO未明确支持此功能，但您可以尝试定期启动流式传输作业，并在几分钟后以编程方式调用Drain。

Apache Beam：具有无限来源的批处理管道

1 个答案: