数据流作业从PubSub(无限制源)读取已配置的窗口,然后将文件写入Cloud存储中。我看到多个文件正在写入。我如何将其限制为更少的数量。我在编写时还尝试了.withNumShards(),但是仍然创建了多个文件。为什么会发生这种情况,以及如何限制只写配置的文件或更少的文件。说我的用例,我每天运行一次作业,然后将其停止。配置的窗口时间为8小时。话虽如此,该作业每天仍需要运行多个文件(可能为20个或更多)。
示例代码段:
Pipeline pipeline = Pipeline.create(options);
PCollection<PubsubMessage> events =
pipeline.apply(PubsubIO.readMessages().fromSubscription(options.getInputSubscription()))
// Windowing
.apply(options.getWindowDuration() + " Window",
Window.into(FixedWindows.of(DurationUtils.parseDuration("8h"))));
// Conversion from PubSub message payload to String using a ParDo
PCollection<String> strMsg = events.apply("To String", ParDo.of(new Extractor()));
// Windowed writes
strMsg.apply("Write File(s)",
TextIO.write().withWindowedWrites()
.to(new WindowedFilenamePolicy(options.getOutputDirectory(), options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(), options.getOutputFilenameSuffix()))
.withTempDirectory(NestedValueProvider.of(options.getOutputDirectory(),
(SerializableFunction<String, ResourceId>) input -> FileBasedSink
.convertToFileResourceIfPossible(input)))
.withNumShards(options.getNumShards()));
return pipeline.run();