如何有效地减少mapper的输入长度

时间:2014-11-27 09:57:11

标签: hadoop mapreduce hdfs

我的数据在架构中有20个字段。就我的地图缩减程序而言,只有前三个字段对我很重要。如何减小mapper的输入大小,以便只接收前三个字段。

1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.

注意我不能使用PIG,因为在MAP REDUCE中实现了一些其他的map reduce逻辑。

2 个答案:

答案 0 :(得分:0)

您可以在map reduce中实现自定义输入格式,以便单独读取所需的字段。

仅供参考,以下博客文章解释了如何将文本作为段落阅读

http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading

答案 1 :(得分:0)

您需要自定义RecordReader才能执行此操作:

public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
   private LineRecordReader lineReader;
   private LongWritable lineKey;
   private Text lineValue;

   public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
      lineReader = new LineRecordReader(job, split);
      lineKey = lineReader.createKey();
      lineValue = lineReader.createValue();
   }

   public boolean next(LongWritable key, Text value) throws IOException {
      if (!lineReader.next(lineKey, lineValue)) {
          return false;
      }

      String[] fields = lineValue.toString().split(",");
      if (fields.length < 3) {
          throw new IOException("Invalid record received");
      }
      value.set(fields[0] + "," + fields[1] + "," + fields[2]);
      return true;
   }

   public LongWritable createKey() {
      return lineReader.createKey();
   }

   public Text createValue() {
      return lineReader.createValue();
   }

   public long getPos() throws IOException {
      return lineReader.getPos();
   }

   public void close() throws IOException {
      lineReader.close();
   }

   public float getProgress() throws IOException {
      return lineReader.getProgress();
   }
} 

它应该是非常明显的,只是LineRecordReader的总结。 不幸的是,要调用它,您还需要扩展InputFormat。以下就足够了:

public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {

   public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
     JobConf job, Reporter reporter) throws IOException {
        reporter.setStatus(input.toString());
        return new TrimmedRecordReader(job, (FileSplit) input);
   }
}

不要忘记在驱动程序中设置它。