有没有一种方法可以使用Apache Beam FileIO为每个记录写入一个文件?

时间:2019-04-28 04:38:05

标签: apache-beam apache-beam-io

我正在学习Apache Beam,并尝试实现类似于distcp的东西。我使用FileIO.read()。filepattern()来获取输入文件,但是在使用FileIO.write进行写入时,有时会合并文件。

无法在作业执行之前知道分区计数。

PCollection<MatchResult.Metadata> pCollection = pipeline.apply(this.name(), FileIO.match().filepattern(path()))
  .apply(FileIO.readMatches())
  .apply(name(), FileIO.<FileIO.ReadableFile>write()
        .via(FileSink.create())
        .to(path()));

接收器代码

@AutoValue
public abstract static class FileSink implements FileIO.Sink<FileIO.ReadableFile> {

    private OutputStream outputStream;

    public static FileSink create() {
      return new AutoValue_FileIOOperator_FileSink();
    }

    @Override
    public void open(WritableByteChannel channel) throws IOException {
      outputStream = Channels.newOutputStream(channel);
    }

    @Override
    public void write(FileIO.ReadableFile element) throws IOException {
      try (final InputStream inputStream = Channels.newInputStream(element.open())) {
        IOUtils.copy(inputStream, outputStream);
      }
    }

    @Override
    public void flush() throws IOException {
      outputStream.flush();
    }
  }

1 个答案:

答案 0 :(得分:1)

您可以使用FileIO.writeDynamic并在.by中指定编写方式。例如,如果您具有唯一键,则可以使用.by(KV::getKey),并且每个键元素都将写入单独的文件中。否则,条件可以是该行的哈希,等等。您也可以随意调整.withNaming。作为演示:

p.apply("Create Data", Create.of(KV.of("one", "this is row 1"), KV.of("two", "this is row 2"), KV.of("three", "this is row 3"), KV.of("four", "this is row 4")))
 .apply(FileIO.<String, KV<String, String>>writeDynamic()
    .by(KV::getKey)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(Contextful.fn(KV::getValue), TextIO.sink())
    .to(output)
    .withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));

这会将四个元素写入四个文件:

$ mvn compile -e exec:java \
 -Dexec.mainClass=com.dataflow.samples.OneRowOneFile \
      -Dexec.args="--project=$PROJECT \
      --output="output/" \
      --runner=DirectRunner"

$ ls output/
file-four-00001-of-00003.txt  file-one-00002-of-00003.txt  file-three-00002-of-00003.txt  file-two-00002-of-00003.txt
$ cat output/file-four-00001-of-00003.txt 
this is row 4

完整代码:

package com.dataflow.samples;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.transforms.Contextful;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;


public abstract class OneRowOneFile {

    public interface Options extends PipelineOptions {
        @Validation.Required
        @Description("Output Path i.e. gs://BUCKET/path/to/output/folder")
        String getOutput();
        void setOutput(String s);
    }

    public static void main(String[] args) {

        OneRowOneFile.Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(OneRowOneFile.Options.class);

        Pipeline p = Pipeline.create(options);

        String output = options.getOutput();

        p.apply("Create Data", Create.of(KV.of("one", "this is row 1"), KV.of("two", "this is row 2"), KV.of("three", "this is row 3"), KV.of("four", "this is row 4")))
         .apply(FileIO.<String, KV<String, String>>writeDynamic()
            .by(KV::getKey)
            .withDestinationCoder(StringUtf8Coder.of())
            .via(Contextful.fn(KV::getValue), TextIO.sink())
            .to(output)
            .withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));

        p.run().waitUntilFinish();
    }
}

让我知道这是否也适用于您的自定义接收器。