将无限集合写入GCS

时间:2017-08-04 18:46:51

标签: google-cloud-dataflow apache-beam apache-beam-io

我在同一主题上看到了很多问题。但是,我仍然有写入GCS的问题。我正在从pubsub阅读这个主题并试图将其推向GCS。我已提到this link。但是,在最新的光束包中找不到IOChannelUtils。

PCollection<String> details = pipeline
            .apply(PubsubIO.readStrings().fromTopic("/topics/<project>/sampleTopic"));

PCollection<KV<String, String>> keyedStream = details.apply(WithKeys.of(new SerializableFunction<String, String>() {
        public String apply(String s) {
            return "constant";
        }
    }));

    PCollection<KV<String, Iterable<String>>> keyedWindows = keyedStream.apply(Window.<KV<String, String>>into(FixedWindows.of(ONE_MIN)).withAllowedLateness(ONE_DAY)
            .triggering(AfterWatermark.pastEndOfWindow().withEarlyFirings(AfterPane.elementCountAtLeast(10))
                    .withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(10),
                            AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_SECONDS))))
            .discardingFiredPanes()).apply(GroupByKey.create());

    PCollection<Iterable<String>> windows = keyedWindows.apply(Values.create());

这是我从堆栈溢出中的许多其他类似主题中获取的。现在,据我所知,TextIO确实支持withWindowedWrites和withNumShards的无界PCollection写选项。

参考:Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn

但是,我不明白,我应该怎么做。

我正在尝试按如下方式写入GCS。

FilenamePolicy policy = DefaultFilenamePolicy.constructUsingStandardParameters(
            StaticValueProvider.of(outputDirectory), DefaultFilenamePolicy.DEFAULT_SHARD_TEMPLATE, "");

    details.apply(TextIO.write().to("gs://<bucket>/topicfile").withWindowedWrites()
            .withFilenamePolicy(policy).withNumShards(4));

我没有足够的要点在Stack Overflow中为这些主题添加评论,因此我将其作为一个不同的问题提出。

2 个答案:

答案 0 :(得分:3)

查看此Pub/Sub to GCS Pipeline,它提供了向GCS写入窗口文件的完整示例。

答案 1 :(得分:2)

我可以通过修改下面给出的窗口来解决这个问题

PCollection<String> streamedDataWindows = streamedData.apply(Window.<String>into(new GlobalWindows())
            .triggering(Repeatedly
                    .forever(AfterProcessingTime
                            .pastFirstElementInPane()
                            .plusDelayOf(Duration.standardSeconds(30))
                        )).withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes());

 streamedDataWindows.apply(TextIO.write().to(CLOUD_STORAGE).withWindowedWrites().withNumShards(1).withFilenamePolicy(new PerWindowFiles()));


public static class PerWindowFiles extends FileBasedSink.FilenamePolicy {

public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext context, String extension) {

// OVERRIDE THE FILE NAME CREATION
}

}

虽然我可以这样解决,但我仍然不确定这里的窗口概念。我会在找到它时添加更多细节。如果有人有更多的了解,请添加更多详细信息。 感谢