Question

在this文章中，我发现了这个用于字数的映射器代码：

  public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, 
                    OutputCollector<Text, IntWritable> output, 
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word, one);
      }
    }
  }

相反，在official tutorial中，这是提供的映射器：

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

到目前为止，我只看到Context从mapper向reducer写了一些内容，我从未见过（或使用过）OutputCollector。我已经阅读了documentation，但我不了解其使用的关键或我为什么要使用它。

Answer 1

两个代码都包含不同的Map Reduce API。OutputCollector位于MRV1中，Context位于MRV2

Java Map Reduce API 1也称为MRV1，发布时带有初始hadoop版本，与这些初始版本相关的缺陷是map reduce框架执行处理任务和资源管理。

Map Reduce 2或下一代Map Reduce是一项期待已久且急需的升级，涉及与Hadoop中的调度，资源管理和执行相关的技术。从根本上说，这些改进将集群资源管理功能与Map Reduce特定逻辑分开，处理和资源管理的这种分离是通过在后续版本的HADOOP中启动YARN实现的。

MRV1使用OutputCollecter和Reporter与MapReduce系统进行通信。

MRV2使用API广泛使用允许用户代码与MapReduce系统通信的context对象。（来自旧API的JobConf，OutputCollector和Reporter的角色由MRV2中的上下文对象统一。）

使用应该使用mapreduce 2（MRV2）。我强调了hadoop 2相比hadoop最大的优势：

一个主要优点是，没有工作者和任务工作者 hadoop2架构。我们有YARN资源管理器和节点经理而不是。这有助于hadoop2支持其他型号 mapreduce框架执行代码并克服高延迟与mapreduce相关的问题。
Hadoop2支持非批处理以及传统批处理操作。
在hadoop2中引入了Hdfs联合。这可以实现多重用于控制hadoop集群的名称节点，它试图处理单个集合 hadoop的点故障问题。

MRV2还有许多其他优点。 https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/

Answer 2

这是一个很好的解决方案，但是，我只使用1行解决方案： int wordcount = string.split（“”）。length - 1;

我何时应该在Hadoop中使用OutputCollector和Context？

2 个答案: