WholeFileInputFormat如何与CombineFileInputFormat一起使用?假设我有10,000个小二进制文件,目前使用的是WholeFileInputFormat。它工作正常但效率不高,因为它创建了10,000个映射器。同样,每个地图任务只需几秒钟。这就是为什么我想将更多文件传递给单个映射器以减少开销的原因。一种选择是使用CombineFileInputFormat。我能够运行它,它创建了预期数量的映射器,但它运行了无限时间。我认为 getProgress 的实现是错误的,并注意到每个地图作业只读取拆分的第一个文件,而不是移动到列表中的下一个文件。
这是我的自定义输入格式:
public class MyCombineFileInputFormat extends CombineFileInputFormat<NullWritable, BytesWritable> {
public MyCombineFileInputFormat() {
super();
setMaxSplitSize(1048576);
}
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<NullWritable, BytesWritable> getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
return new CombineFileRecordReader<NullWritable, BytesWritable>(job, (CombineFileSplit) split, reporter,
(Class) MyCombineFileRecordReader.class);
}
}
这是我的自定义合并文件阅读器:
public class MyCombineFileRecordReader implements RecordReader<NullWritable, BytesWritable> {
private NullWritable key = NullWritable.get();
private BytesWritable value = new BytesWritable();
private Path path;
private FileSystem fs;
private FileSplit filesplit;
private int totalNumOfPaths;
private static int processedPaths;
public static Logger LOGGER = Logger.getLogger(MyCombineFileRecordReader.class);
public MyCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index)
throws IOException {
this.totalNumOfPaths = split.getNumPaths();
LOGGER.info("**** Total number of paths: " + totalNumOfPaths);
this.filesplit =
new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index),
split.getLocations());
this.path = split.getPath(index);
this.fs = this.path.getFileSystem(conf);
processedPaths = 0;
}
--- I think this is not right
@Override
public float getProgress() throws IOException {
if (processedPaths == totalNumOfPaths) {
LOGGER.info("**** Completed # of files");
return 1.0f;
} else {
return 0.0f;
}
}
--- I have found that this method is being called multiple times for the same file
@Override
public boolean next(NullWritable key, BytesWritable val) throws IOException {
if (filesplit != null) {
byte[] contents = new byte[(int) filesplit.getLength()];
LOGGER.info("**** Reading path: " + path);
FSDataInputStream in = null;
try {
in = fs.open(path);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
LOGGER.info("**** Processed path count: " + processedPaths);
processedPaths++;
} finally {
IOUtils.closeStream(in);
}
return true;
}
return false;
}
@Override
public NullWritable createKey() {
return key;
}
@Override
public BytesWritable createValue() {
return value;
}
}