WholeFileInputFormat以及CombineFileInputFormat

时间:2015-04-20 23:50:31

标签: hadoop mapreduce hdfs bigdata yarn

WholeFileInputFormat如何与CombineFileInputFormat一起使用?假设我有10,000个小二进制文件,目前使用的是WholeFileInputFormat。它工作正常但效率不高,因为它创建了10,000个映射器。同样,每个地图任务只需几秒钟。这就是为什么我想将更多文件传递给单个映射器以减少开销的原因。一种选择是使用CombineFileInputFormat。我能够运行它,它创建了预期数量的映射器,但它运行了无限时间。我认为 getProgress 的实现是错误的,并注意到每个地图作业只读取拆分的第一个文件,而不是移动到列表中的下一个文件。

这是我的自定义输入格式:

public class MyCombineFileInputFormat extends CombineFileInputFormat<NullWritable, BytesWritable> {

public MyCombineFileInputFormat() {
    super();
    setMaxSplitSize(1048576);
}


protected boolean isSplitable(JobContext context, Path file) {
    return false;
}


@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<NullWritable, BytesWritable> getRecordReader(InputSplit split, JobConf job,
        Reporter reporter) throws IOException {
    return new CombineFileRecordReader<NullWritable, BytesWritable>(job, (CombineFileSplit) split, reporter,
        (Class) MyCombineFileRecordReader.class);
}

}

这是我的自定义合并文件阅读器:

public class MyCombineFileRecordReader implements RecordReader<NullWritable, BytesWritable> {

private NullWritable key = NullWritable.get();

private BytesWritable value = new BytesWritable();
private Path path;
private FileSystem fs;
private FileSplit filesplit;
private int totalNumOfPaths;
private static int processedPaths;

public static Logger LOGGER = Logger.getLogger(MyCombineFileRecordReader.class);


public MyCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index)
        throws IOException {
    this.totalNumOfPaths = split.getNumPaths();
    LOGGER.info("**** Total number of paths: " + totalNumOfPaths);
    this.filesplit =
            new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index),
                split.getLocations());
    this.path = split.getPath(index);
    this.fs = this.path.getFileSystem(conf);
    processedPaths = 0;
}

--- I think this is not right
@Override
public float getProgress() throws IOException {
    if (processedPaths == totalNumOfPaths) {
        LOGGER.info("**** Completed # of files");
        return 1.0f;
    } else {
        return 0.0f;
    }
}

--- I have found that this method is being called multiple times for the same file 
@Override
public boolean next(NullWritable key, BytesWritable val) throws IOException {
    if (filesplit != null) {
        byte[] contents = new byte[(int) filesplit.getLength()];
        LOGGER.info("**** Reading path: " + path);
        FSDataInputStream in = null;
        try {
            in = fs.open(path);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
            LOGGER.info("**** Processed path count: " + processedPaths);
            processedPaths++;
        } finally {
            IOUtils.closeStream(in);
        }
        return true;
    }
    return false;
}


@Override
public NullWritable createKey() {
    return key;
}
@Override
public BytesWritable createValue() {
    return value;
}

}

0 个答案:

没有答案