我想使用Hadoop 0.20.0 / 0.20.2的CombineFileInputFormat,这样它每个记录处理1个文件,并且不会在数据 - 地点(它通常需要处理)上妥协。
在Tom White的Hadoop权威指南中提到过,但他没有展示如何做到这一点。相反,他转向序列文件。
我对记录阅读器中已处理变量的含义感到困惑。 任何代码示例都会有很大的帮助。
提前致谢..
答案 0 :(得分:1)
检查用于组合文件输入格式的以下输入格式。
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
/**
* CustomInputformat which implements the createRecordReader of abstract class CombineFileInputFormat
*/
public class MyCombineFileInputFormat extends CombineFileInputFormat {
public static class MyRecordReader extends RecordReader<LongWritable,Text>{
private LineRecordReader delegate=null;
private int idx;
public MyRecordReader(CombineFileSplit split,TaskAttemptContext taskcontext ,Integer idx) throws IOException {
this.idx=idx;
delegate = new LineRecordReader();
}
@Override
public void close() throws IOException {
delegate.close();
}
@Override
public float getProgress() {
try {
return delegate.getProgress();
}
catch(Exception e) {
return 0;
}
}
@Override
public void initialize(InputSplit split, TaskAttemptContext taskcontext) throws IOException {
CombineFileSplit csplit=(CombineFileSplit)split;
FileSplit fileSplit = new FileSplit(csplit.getPath(idx), csplit.getOffset(idx), csplit.getLength(idx), csplit.getLocations());
delegate.initialize(fileSplit, taskcontext);
}
@Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return delegate.getCurrentKey();
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return delegate.getCurrentValue();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return delegate.nextKeyValue();
}
}
@SuppressWarnings("unchecked")
@Override
public RecordReader createRecordReader(InputSplit split,TaskAttemptContext taskcontext) throws IOException {
return new CombineFileRecordReader((CombineFileSplit) split, taskcontext, MyRecordReader.class);
}
}
答案 1 :(得分:0)
这是使用来自所谓的&#34;新API&#34;的CombineFileInputFormat的最简单方法。假设您的实际输入格式是 MyFormat ,它可以使用 MyKey 的键和 MyValue 的值(可能是{{1}的某些子类},例如)。
SequenceFileInputFormat< MyKey, MyValue >
在您的工作驱动程序中,您现在应该只需public class CombinedMyFormat extends CombineFileInputFormat< MyKey, MyValue > {
// exists merely to fix the key/value types and
// inject the delegate format to the superclass
// if MyFormat does not use state, consider a constant instead
private static class CombineMyKeyMyValueReaderWrapper
extends CombineFileRecordReaderWrapper< MyKey, MyValue > {
protected CombineMyKeyMyValueReaderWrapper(
CombineFileSplit split, TaskAttemptContext ctx, Integer idx
) throws IOException, InterruptedException {
super( new MyFormat(), split, ctx, idx );
}
}
@Override
public RecordReader< MyKey, MyValue > createRecordReader(
InputSplit split, TaskAttemptContext ctx
) throws IOException {
return new CombineFileRecordReader< MyKey, MyValue >(
( CombineFileSplit )split, ctx, CombineMyKeyMyValueReaderWrapper.class
);
}
}
来CombinedMyFormat
。您还应该设置max split size property以防止Hadoop将整个输入组合成一个分割。