如何在Dataflow Java SDK中压缩输出文件?

时间:2016-12-28 00:31:17

标签: google-cloud-dataflow

我的管道将输出数据文件存储到GCS。 我想压缩这个文件。 TextIO已解压缩压缩的文件, 但我想它没有压缩文件。 如何压缩输出文件?

2 个答案:

答案 0 :(得分:1)

TextIO仅支持读取压缩文件。它不支持使用压缩编写文件。

https://cloud.google.com/dataflow/model/text-io#reading-from-compressed-text-files

  

TextIO目前不支持写入压缩文件。

更多信息:

答案 1 :(得分:1)

目前这是DataFlow的开放feature request,但是已经在Beam中完成了工作。一旦Dataflow 2.0发布(将基于Beam),这应该得到官方支持。

也就是说,我已经能够通过扩展FileBasedSink类并利用Jeff Payne在Beam中对此功能的工作来编写压缩的GZIP文件。

public class GZIPSink<T> extends FileBasedSink<T>  {
    private final Coder<T> coder;

    GZIPSink(String baseOutputFilename, Coder<T> coder) {
        super(baseOutputFilename, ".gz");
        this.coder = coder;
    }

    @Override
    public FileBasedWriteOperation createWriteOperation(PipelineOptions pipelineOptions) {
        return new GZIPWriteOperation(this, coder);
    }

    static class GZIPWriteOperation<T> extends FileBasedSink.FileBasedWriteOperation<T> {
        private final Coder<T> coder;

        private GZIPWriteOperation(GZIPSink<T> sink, Coder<T> coder) {
            super(sink);
            this.coder = coder;
        }

        @Override
        public FileBasedWriter createWriter(PipelineOptions pipelineOptions) throws Exception {
            return new GZIPBasedWriter(this, coder);
        }
    }

    static class GZIPBasedWriter<T> extends FileBasedSink.FileBasedWriter <T> {
        private static final byte[] NEWLINE = "\n".getBytes(StandardCharsets.UTF_8);
        private final Coder<T> coder;
        private GZIPOutputStream out;

        public GZIPBasedWriter(FileBasedWriteOperation<T> writeOperation, Coder<T> coder) {
            super(writeOperation);
            this.mimeType = MimeTypes.BINARY;
            this.coder = coder;
        }

        @Override
        protected void prepareWrite(WritableByteChannel channel) throws Exception {
            out = new GZIPOutputStream(Channels.newOutputStream(channel), true) {{
                def.setLevel(def.BEST_COMPRESSION);
            }};
        }

        @Override
        public void write(T value) throws Exception {
            coder.encode(value, out, Coder.Context.OUTER);
            out.write(NEWLINE);
        }

        @Override
        public void writeFooter() throws IOException {
            out.finish();
        }
    }
}     

然后实际写下来:

aStringPCollection.apply(Write.to(new GZIPSink("gs://path/sharded-filename", StringUtf8Coder.of()));