Question

在gzip文件上使用CombineFileInputFormat的最佳方法是什么？

Answer 1

This文章将帮助您在CombineFIleInputFOrmat的帮助下构建自己的Inputformat，以读取和处理gzip文件。下面的部分将让您了解需要做什么。

自定义输入格式：

构建自己的自定义combinefileinputformat几乎与combinefileinputformat相同。密钥必须是我们自己的可写类，它将保存文件名，偏移量和值将是实际的文件内容。必须将issplittable设置为false（我们不想拆分文件）。将maxsplitsize设置为您的需求值。基于该值，Combinefilerecordreader决定拆分的数量，并为每个拆分创建一个实例。您必须通过向其添加解压缩逻辑来构建自己的自定义记录阅读器。

自定义RecordReader：

Custom Recordreader使用linereader并将密钥设置为文件名，偏移量和值作为实际文件内容。如果文件被压缩，它会解压缩并读取它。这是摘录。

private void codecWiseDecompress(Configuration conf) throws IOException{

         CompressionCodecFactory factory = new CompressionCodecFactory(conf);
         CompressionCodec codec = factory.getCodec(path);

            if (codec == null) {
                System.err.println("No Codec Found For " + path);
                System.exit(1);
            }

            String outputUri = 
CompressionCodecFactory.removeSuffix(path.toString(), 
codec.getDefaultExtension());
            dPath = new Path(outputUri);

            InputStream in = null;
            OutputStream out = null;
            fs = this.path.getFileSystem(conf);

            try {
                in = codec.createInputStream(fs.open(path));
                out = fs.create(dPath);
                IOUtils.copyBytes(in, out, conf);
                } finally {
                    IOUtils.closeStream(in);
                    IOUtils.closeStream(out);
                    rlength = fs.getFileStatus(dPath).getLen();
                }
      }

自定义可写类：

具有文件名，偏移值

的对

如何在gzip文件上使用CombineFileInputFormat？

1 个答案: