我的管道将输出数据文件存储到GCS。 我想压缩这个文件。 TextIO已解压缩压缩的文件, 但我想它没有压缩文件。 如何压缩输出文件?
答案 0 :(得分:1)
TextIO
仅支持读取压缩文件。它不支持使用压缩编写文件。
https://cloud.google.com/dataflow/model/text-io#reading-from-compressed-text-files
TextIO目前不支持写入压缩文件。
更多信息:
答案 1 :(得分:1)
目前这是DataFlow的开放feature request,但是已经在Beam中完成了工作。一旦Dataflow 2.0发布(将基于Beam),这应该得到官方支持。
也就是说,我已经能够通过扩展FileBasedSink类并利用Jeff Payne在Beam中对此功能的工作来编写压缩的GZIP文件。
public class GZIPSink<T> extends FileBasedSink<T> {
private final Coder<T> coder;
GZIPSink(String baseOutputFilename, Coder<T> coder) {
super(baseOutputFilename, ".gz");
this.coder = coder;
}
@Override
public FileBasedWriteOperation createWriteOperation(PipelineOptions pipelineOptions) {
return new GZIPWriteOperation(this, coder);
}
static class GZIPWriteOperation<T> extends FileBasedSink.FileBasedWriteOperation<T> {
private final Coder<T> coder;
private GZIPWriteOperation(GZIPSink<T> sink, Coder<T> coder) {
super(sink);
this.coder = coder;
}
@Override
public FileBasedWriter createWriter(PipelineOptions pipelineOptions) throws Exception {
return new GZIPBasedWriter(this, coder);
}
}
static class GZIPBasedWriter<T> extends FileBasedSink.FileBasedWriter <T> {
private static final byte[] NEWLINE = "\n".getBytes(StandardCharsets.UTF_8);
private final Coder<T> coder;
private GZIPOutputStream out;
public GZIPBasedWriter(FileBasedWriteOperation<T> writeOperation, Coder<T> coder) {
super(writeOperation);
this.mimeType = MimeTypes.BINARY;
this.coder = coder;
}
@Override
protected void prepareWrite(WritableByteChannel channel) throws Exception {
out = new GZIPOutputStream(Channels.newOutputStream(channel), true) {{
def.setLevel(def.BEST_COMPRESSION);
}};
}
@Override
public void write(T value) throws Exception {
coder.encode(value, out, Coder.Context.OUTER);
out.write(NEWLINE);
}
@Override
public void writeFooter() throws IOException {
out.finish();
}
}
}
然后实际写下来:
aStringPCollection.apply(Write.to(new GZIPSink("gs://path/sharded-filename", StringUtf8Coder.of()));