我的数据在架构中有20个字段。就我的地图缩减程序而言,只有前三个字段对我很重要。如何减小mapper的输入大小,以便只接收前三个字段。
1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.
注意我不能使用PIG,因为在MAP REDUCE中实现了一些其他的map reduce逻辑。
答案 0 :(得分:0)
您可以在map reduce中实现自定义输入格式,以便单独读取所需的字段。
仅供参考,以下博客文章解释了如何将文本作为段落阅读
http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading
答案 1 :(得分:0)
您需要自定义RecordReader
才能执行此操作:
public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(LongWritable key, Text value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
String[] fields = lineValue.toString().split(",");
if (fields.length < 3) {
throw new IOException("Invalid record received");
}
value.set(fields[0] + "," + fields[1] + "," + fields[2]);
return true;
}
public LongWritable createKey() {
return lineReader.createKey();
}
public Text createValue() {
return lineReader.createValue();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public void close() throws IOException {
lineReader.close();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
}
它应该是非常明显的,只是LineRecordReader
的总结。
不幸的是,要调用它,您还需要扩展InputFormat
。以下就足够了:
public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(input.toString());
return new TrimmedRecordReader(job, (FileSplit) input);
}
}
不要忘记在驱动程序中设置它。