Question

我想使用Hadoop 0.20.0 / 0.20.2的CombineFileInputFormat，这样它每个记录处理1个文件，并且不会在数据 - 地点（它通常需要处理）上妥协。

在Tom White的Hadoop权威指南中提到过，但他没有展示如何做到这一点。相反，他转向序列文件。

我对记录阅读器中已处理变量的含义感到困惑。任何代码示例都会有很大的帮助。

提前致谢..

Answer 1

检查用于组合文件输入格式的以下输入格式。

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;


/**
 * CustomInputformat which implements the createRecordReader of abstract class CombineFileInputFormat
 */

public class MyCombineFileInputFormat extends CombineFileInputFormat {

    public static class MyRecordReader extends RecordReader<LongWritable,Text>{
        private LineRecordReader delegate=null;
        private int idx;

        public MyRecordReader(CombineFileSplit split,TaskAttemptContext taskcontext ,Integer idx) throws IOException {
            this.idx=idx;
            delegate = new LineRecordReader();
        }

        @Override
        public void close() throws IOException {
            delegate.close();
        }

        @Override
        public float getProgress() {
            try {
                return delegate.getProgress();
            }
            catch(Exception e) {
                return 0;
            }
        }

        @Override
        public void initialize(InputSplit split, TaskAttemptContext taskcontext) throws IOException {
            CombineFileSplit csplit=(CombineFileSplit)split;
            FileSplit fileSplit = new FileSplit(csplit.getPath(idx), csplit.getOffset(idx), csplit.getLength(idx), csplit.getLocations());
            delegate.initialize(fileSplit, taskcontext);
        }

        @Override
        public LongWritable getCurrentKey() throws IOException,
                InterruptedException {
            return delegate.getCurrentKey();
        }


        @Override
        public Text getCurrentValue() throws IOException, InterruptedException {
            return delegate.getCurrentValue();
        }

        @Override
        public boolean nextKeyValue() throws IOException, InterruptedException {
            return delegate.nextKeyValue();
        }

    }

    @SuppressWarnings("unchecked")
    @Override
    public RecordReader createRecordReader(InputSplit split,TaskAttemptContext taskcontext) throws IOException {
        return new CombineFileRecordReader((CombineFileSplit) split, taskcontext, MyRecordReader.class);
    }
}

Answer 2

这是使用来自所谓的＆＃34;新API＆＃34;的CombineFileInputFormat的最简单方法。假设您的实际输入格式是 MyFormat ，它可以使用 MyKey 的键和 MyValue 的值（可能是{{1}的某些子类}，例如）。

SequenceFileInputFormat< MyKey, MyValue >

在您的工作驱动程序中，您现在应该只需public class CombinedMyFormat extends CombineFileInputFormat< MyKey, MyValue > { // exists merely to fix the key/value types and // inject the delegate format to the superclass // if MyFormat does not use state, consider a constant instead private static class CombineMyKeyMyValueReaderWrapper extends CombineFileRecordReaderWrapper< MyKey, MyValue > { protected CombineMyKeyMyValueReaderWrapper( CombineFileSplit split, TaskAttemptContext ctx, Integer idx ) throws IOException, InterruptedException { super( new MyFormat(), split, ctx, idx ); } } @Override public RecordReader< MyKey, MyValue > createRecordReader( InputSplit split, TaskAttemptContext ctx ) throws IOException { return new CombineFileRecordReader< MyKey, MyValue >( ( CombineFileSplit )split, ctx, CombineMyKeyMyValueReaderWrapper.class ); } }来CombinedMyFormat。您还应该设置max split size property以防止Hadoop将整个输入组合成一个分割。

如何在Hadoop中使用CombineFileInputFormat？

2 个答案: