Question

我正在使用WordCount示例，在Reduce函数中，我需要获取文件名。

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    String filename = ((FileSplit)(.getContext()).getInputSplit()).getPath().getName();
    // ----------------------------^ I need to get the context and filename!
    key.set(key.toString() + " (" + filename + ")");
    output.collect(key, new IntWritable(sum));
  }
}

这是上面修改过的代码，我想要获取要为该单词打印的文件名。我尝试关注Java Hadoop: How can I create mappers that take as input files and give an output which is the number of lines in each file?但我无法获得context对象。

我是hadoop的新手，需要这个帮助。有帮助吗？

Answer 1

您无法获得context，因为context是“新API”的构造，而您使用的是“旧API”。

请查看此单词计数示例：http://wiki.apache.org/hadoop/WordCount

在这种情况下，请参阅reduce函数的签名：

public void reduce(Text key, Iterable<IntWritable> values, Context context)

请参阅！上下文！请注意，在此示例中，它从.mapreduce.而不是.mapred.导入。

对于新的hadoop用户来说，这是一个常见的问题，所以不要感觉不好。通常，您希望坚持使用新API，原因有很多。但是，要非常小心你找到的例子。此外，要意识到新API和旧API不可互操作（例如，您不能拥有新的API映射器和旧的API缩减器）。

Answer 2

使用旧的MR API（org.apache.hadoop.mapred包），将下面的内容添加到mapper / reducer类中。

String fileName = new String();
public void configure(JobConf job)
{
    filename = job.get("map.input.file");
}

使用新的MR API（org.apache.hadoop.mapreduce包），将下面的内容添加到mapper / reducer类中。

String fileName = new String();
protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException
{
    fileName = ((FileSplit) context.getInputSplit()).getPath().toString();
}

Answer 3

我用这种方式工作!!!

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
      String filename = fileSplit.getPath().getName();
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

让我知道我是否可以改进它！

如何在Hadoop Reduce中获取当前文件名

3 个答案: