将文件移动到另一个GCS文件夹并在执行Apache Beam管道后执行操作

时间:2019-09-20 12:06:03

标签: google-cloud-dataflow apache-beam dataflow apache-beam-io

我创建了一个流式Apache Beam管道,该管道可以从GCS文件夹中读取文件并将其插入BigQuery,它可以完美运行,但是当我停止并运行该作业时,它会重新处理所有文件,因此所有数据将再次被复制。

所以我的想法是将文件从扫描的目录移动到另一个目录,但是我不知道如何用Apache Beam进行处理。

谢谢


public static PipelineResult run(Options options) {
// Create the pipeline.

        Pipeline pipeline = Pipeline.create(options);

        /*
         * Steps:
         *  1) Read from the text source.
         *  2) Write each text record to Pub/Sub
         */

        LOG.info("Running pipeline");
        LOG.info("Input : " + options.getInputFilePattern());
        LOG.info("Output : " + options.getOutputTopic());

        PCollection<String> collection = pipeline
                .apply("Read Text Data", TextIO.read()
                        .from(options.getInputFilePattern())
                        .watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))

                .apply("Write logs", ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception {
                        LOG.info(c.element());
                        c.output(c.element());
                    }
                }));

        collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));

        return pipeline.run();
    }

1 个答案:

答案 0 :(得分:1)

一些提示:

  • 通常不希望您停止并重新运行流传输管道。流管道更适合永久运行,如果您想更改逻辑,有时会进行更新。
  • 尽管如此,仍可以使用FileIO匹配多个文件,并在处理完文件后将其移动。

您将这样编写一个DoFn类:ReadWholeFileThenMoveToAnotherBucketDoFn,它将读取整个文件,然后 将其移至新存储桶。

    Pipeline pipeline = Pipeline.create(options);


    PCollection<FileIO.Match> matches = pipeline
            .apply("Read Text Data", FileIO.match()
                    .filepattern(options.getInputFilePattern())
                    .continuously(Duration.standardSeconds(60), 
                                  Watch.Growth.<String>never()));

    matches.apply(FileIO.readMatches())
           .apply(ParDo.of(new ReadWholeFileThenMoveToAnotherBucketDoFn()))
            .apply("Write logs", ParDo.of(new DoFn<String, String>() {
                @ProcessElement
                public void processElement(ProcessContext c) throws Exception {
                    LOG.info(c.element());
                    c.output(c.element());
                }
            }));

    ....