用于Apache Flink的ZIP压缩输入

时间:2018-03-06 01:41:31

标签: zip apache-flink

我需要在Apache Flink中阅读并处理 zip存档中的特定文件

在文档中,我找到了

  

如果输入文件标有适当的文件扩展名,则Flink目前支持输入文件的透明解压缩。

https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/batch/#read-compressed-files

是否可以在Apache Flink中动态解压缩时处理它?<​​/ p>

2 个答案:

答案 0 :(得分:1)

FileInputFormat会将读取压缩文件委托给GZIPInputStream,这将在解压缩时返回部分解压缩数据。

答案 1 :(得分:0)

我想分享我同时实施的解决方案。

因此,在创建了我自己的InputFormat后,我在open()方法中使用了以下代码:

@Override
public void open(final FileInputSplit ignored) throws IOException {
    ...
    final XMLInputFactory xmlif = XMLInputFactory.newInstance();
    final XMLStreamReader xmlr = xmlif.createXMLStreamReader(filePath.toString(),
              InputFormatUtil.readFileWithinZipArchive(filePath, nestedXmlFileName));
    while (xmlr.hasNext()) {
    ...
}

readFileWithinZipArchive(...)的实施是:

public static InputStream readFileWithinZipArchive(final Path zipPath, final String filename) throws IOException {
    // using org.apache.flink.core.fs.Path for getting the InputStream from the (remote) zip archive
    final InputStream zipInputStream = zipPath.getFileSystem().open(zipPath);
    // generating a temporary local copy of the zip file
    final File tmpFile = stream2file(zipInputStream);
    // then using java.util.zip.ZipFile for extracting the InputStream for the specific file within the zip archive
    final ZipFile zipFile = new ZipFile(tmpFile);
    return zipFile.getInputStream(zipFile.getEntry(filename));
}