我创建了一个流式Apache Beam管道,该管道可以从GCS文件夹中读取文件并将其插入BigQuery,它可以完美运行,但是当我停止并运行该作业时,它会重新处理所有文件,因此所有数据将再次被复制。
所以我的想法是将文件从扫描的目录移动到另一个目录,但是我不知道如何用Apache Beam进行处理。
谢谢
public static PipelineResult run(Options options) {
// Create the pipeline.
Pipeline pipeline = Pipeline.create(options);
/*
* Steps:
* 1) Read from the text source.
* 2) Write each text record to Pub/Sub
*/
LOG.info("Running pipeline");
LOG.info("Input : " + options.getInputFilePattern());
LOG.info("Output : " + options.getOutputTopic());
PCollection<String> collection = pipeline
.apply("Read Text Data", TextIO.read()
.from(options.getInputFilePattern())
.watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))
.apply("Write logs", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) throws Exception {
LOG.info(c.element());
c.output(c.element());
}
}));
collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));
return pipeline.run();
}
答案 0 :(得分:1)
一些提示:
您将这样编写一个DoFn类:ReadWholeFileThenMoveToAnotherBucketDoFn
,它将读取整个文件,然后 将其移至新存储桶。
Pipeline pipeline = Pipeline.create(options);
PCollection<FileIO.Match> matches = pipeline
.apply("Read Text Data", FileIO.match()
.filepattern(options.getInputFilePattern())
.continuously(Duration.standardSeconds(60),
Watch.Growth.<String>never()));
matches.apply(FileIO.readMatches())
.apply(ParDo.of(new ReadWholeFileThenMoveToAnotherBucketDoFn()))
.apply("Write logs", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) throws Exception {
LOG.info(c.element());
c.output(c.element());
}
}));
....