使用窗口化使用Cloud Dataflow从PubSub写入Google云端存储

时间:2016-11-06 11:31:12

标签: google-cloud-dataflow

我正在通过pubsub以流模式接收数据流的消息(这是我的愿望所必需的)。 每条消息都应存储在GCS中自己的文件中。 由于不支持TextIO.Write中的无界集合,我试图将PCollection分成包含每个元素的窗口。 并将每个窗口写入google-cloud-storage。

这是我的代码:

public static void main(String[] args) {    

          DataflowPipelineOptions options = PipelineOptionsFactory.create()
                  .as(DataflowPipelineOptions.class);
                options.setRunner(BlockingDataflowPipelineRunner.class);                
                options.setProject(PROJECT_ID);             
                options.setStagingLocation(STAGING_LOCATION);
                options.setStreaming(true);
                Pipeline pipeline = Pipeline.create(options);

                PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
                        .subscription(SUBSCRIPTION);

                PCollection<String> streamData = pipeline.apply(readFromPubsub);        



                PCollection<String> windowedMessage = streamData.apply(Window.<String>triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))).discardingFiredPanes());
            e


                windowedMessage.apply(TextIO.Write.to("gs://pubsub-outputs/1"));

                pipeline.run();
        }

我仍然收到窗口之前得到的相同错误。

The DataflowPipelineRunner in streaming mode does not support TextIO.Write.

执行上述操作的代码是什么。

1 个答案:

答案 0 :(得分:2)

TextIO使用Bound PCollection,您可以使用API​​存储写入GCS。

你可以这样做:

    PipeOptions options = data.getPipeline().getOptions().as(PipeOptions.class);
    data.apply(WithKeys.of(new SerializableFunction<String, String>() {
             public String apply(String s) { return "mykey"; } }))          

    .apply(Window.<KV<String, String>>into(FixedWindows.of(Duration.standardMinutes(options.getTimeWrite()))))
    .apply(GroupByKey.create())
    .apply(Values.<Iterable<String>>create())
    .apply(ParDo.of(new StorageWrite(options)));

您创建一个具有groupBy操作的Window,您可以使用iterable写入Storage。 StorageWrite的processElement:

        PipeOptions options = c.getPipelineOptions().as(PipeOptions.class);
        String date = ISODateTimeFormat.date().print(c.window().maxTimestamp());
        String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
        String blobName = String.format("%s/%s/%s", options.getBucketRepository(), date, options.getFileOutName() + isoDate);

        BlobId blobId = BlobId.of(options.getGCSBucket(), blobName);

        WriteChannel writer = storage.writer(BlobInfo.builder(blobId).contentType("text/plain").build());

        for (Iterator<String> it = c.element().iterator(); it.hasNext();) {
            writer.write(ByteBuffer.wrap(it.next().getBytes()));
        }
        writer.close();