确定块属于Hadoop中的哪个文件路径

时间:2016-04-28 09:11:57

标签: hadoop

我的工作有多个输入路径。例如:

    //Driver.class
        for (String s : listFile) {
            MultipleInputs.addInputPath(job, new Path(s), SequenceFileInputFormat.class);// ex: /home/path1, /home/path2, ...
        }
        .....
    //Mapper.class
        public void map(Text key, Data bytes, Context context) throws IOException, InterruptedException {
                .....
            }

我的问题是,有没有办法确定map()函数中的当前对(键,值)属于哪个文件?

1 个答案:

答案 0 :(得分:0)

由于您使用SequenceFileInputFormat作为InputFormatSequenceFileInputFormat使用SequenceFileRecordReader作为其RecordReader,并扩展FileInputFormat的方法getSplits() 1}}返回拥有FileSplit的{​​{1}},当然Path可以获得SequenceFileRecordReader。因此,您需要做的是,当您获得Pathkey时,其中一个包含valuePath需要执行此操作。

以下是步骤:

  • 制作包含RecordReaderoriginal value的自定义valClass:

    path
  • 自定义class YourValClass implements Writable { Writable value; // your orginal value Path path; // the path you want } 课程延长InputFormat,并覆盖SequenceFileInputFormat方法以返回自定义createRecordReader()

    RecordReader
  • 制作自定义class YourInputputFormat extends SequenceFileInputFormat<YourKeyClass, YourValClass> { @Override public RecordReader<YourKeyClass, YourValClass> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException { return new YourRecordReader(); // return your custom RecordReader } } ,您可以将RecordReadervalue合并在一起:

    path

现在,您可以从class YourRecordReader extends SequenceFileRecordReader<YourKeyClass, YourValClass> { Path path; @Override public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { super.initialize(inputSplit, taskAttemptContext); FileSplit fileSplit = (FileSplit) inputSplit; this.path = fileSplit.getPath(); // assign the path } @Override public YourValClass getCurrentValue() { YourValClass val = super.getCurrentValue(); if (null != val) { val.path = path; // set the path } return val; } } 值获取path