Question

我创建了一个流式Apache Beam管道，该管道可以从GCS文件夹中读取文件并将其插入BigQuery，它可以完美运行，但是当我停止并运行该作业时，它会重新处理所有文件，因此所有数据将再次被复制。

所以我的想法是将文件从扫描的目录移动到另一个目录，但是我不知道如何用Apache Beam进行处理。

谢谢


public static PipelineResult run(Options options) {
// Create the pipeline.

        Pipeline pipeline = Pipeline.create(options);

        /*
         * Steps:
         *  1) Read from the text source.
         *  2) Write each text record to Pub/Sub
         */

        LOG.info("Running pipeline");
        LOG.info("Input : " + options.getInputFilePattern());
        LOG.info("Output : " + options.getOutputTopic());

        PCollection<String> collection = pipeline
                .apply("Read Text Data", TextIO.read()
                        .from(options.getInputFilePattern())
                        .watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))

                .apply("Write logs", ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception {
                        LOG.info(c.element());
                        c.output(c.element());
                    }
                }));

        collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));

        return pipeline.run();
    }

Answer 1

一些提示：

通常不希望您停止并重新运行流传输管道。流管道更适合永久运行，如果您想更改逻辑，有时会进行更新。
尽管如此，仍可以使用FileIO匹配多个文件，并在处理完文件后将其移动。

您将这样编写一个DoFn类：ReadWholeFileThenMoveToAnotherBucketDoFn，它将读取整个文件，然后将其移至新存储桶。

    Pipeline pipeline = Pipeline.create(options);


    PCollection<FileIO.Match> matches = pipeline
            .apply("Read Text Data", FileIO.match()
                    .filepattern(options.getInputFilePattern())
                    .continuously(Duration.standardSeconds(60), 
                                  Watch.Growth.<String>never()));

    matches.apply(FileIO.readMatches())
           .apply(ParDo.of(new ReadWholeFileThenMoveToAnotherBucketDoFn()))
            .apply("Write logs", ParDo.of(new DoFn<String, String>() {
                @ProcessElement
                public void processElement(ProcessContext c) throws Exception {
                    LOG.info(c.element());
                    c.output(c.element());
                }
            }));

    ....

将文件移动到另一个GCS文件夹并在执行Apache Beam管道后执行操作

1 个答案: