Question

数据流作业从PubSub（无限制源）读取已配置的窗口，然后将文件写入Cloud存储中。我看到多个文件正在写入。我如何将其限制为更少的数量。我在编写时还尝试了.withNumShards（），但是仍然创建了多个文件。为什么会发生这种情况，以及如何限制只写配置的文件或更少的文件。说我的用例，我每天运行一次作业，然后将其停止。配置的窗口时间为8小时。话虽如此，该作业每天仍需要运行多个文件（可能为20个或更多）。

示例代码段：

Pipeline pipeline = Pipeline.create(options);
PCollection<PubsubMessage> events =
pipeline.apply(PubsubIO.readMessages().fromSubscription(options.getInputSubscription()))
                // Windowing
                .apply(options.getWindowDuration() + " Window",
                        Window.into(FixedWindows.of(DurationUtils.parseDuration("8h"))));

// Conversion from PubSub message payload to String using a ParDo
PCollection<String> strMsg = events.apply("To String", ParDo.of(new Extractor()));

// Windowed writes
strMsg.apply("Write File(s)",
        TextIO.write().withWindowedWrites()
                .to(new WindowedFilenamePolicy(options.getOutputDirectory(), options.getOutputFilenamePrefix(),
                        options.getOutputShardTemplate(), options.getOutputFilenameSuffix()))
                .withTempDirectory(NestedValueProvider.of(options.getOutputDirectory(),
                        (SerializableFunction<String, ResourceId>) input -> FileBasedSink
                                .convertToFileResourceIfPossible(input)))
                .withNumShards(options.getNumShards()));

return pipeline.run();

从无界输入中读取后，数据流作业正在创建多个文件

0 个答案: