Question

https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html#method.summary

org.apache.hadoop.mapreduce.Mapper

运行（上下文）方法

a). Expert users can override this method for more complete control over the execution of the Mapper.

目前run（Context）方法的默认行为是什么。
如果我覆盖run（上下文），那么根据文档会得到哪种特殊的控件？
是否有人在您的实施中覆盖了此方法？

Answer 1

目前run（Context）方法的默认行为是什么。

默认实现在Mapper类的Apache Hadoop源代码中可见：

/**
 * Expert users can override this method for more complete control over the
 * execution of the Mapper.
 * @param context
 * @throws IOException
 */
public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  try {
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
  } finally {
    cleanup(context);
  }
}

总结：

致电setup进行一次性初始化。
遍历输入中的所有键值对。
将密钥和值传递给map方法实现。
致电cleanup进行一次性拆解。

如果我覆盖run（Context），那么根据文档会得到什么样的特殊控件？

默认实现始终遵循单个线程中的特定执行顺序。覆盖这种情况很少见，但它可能为高度专业化的实现开辟了可能性，例如不同的线程模型或尝试合并冗余的键范围。

是否有人在您的实施中覆盖了此方法？

在Apache Hadoop代码库中，有两个覆盖：

ChainMapper允许将多个Mapper类实现链接在一起，以便在单个map任务中执行。覆盖run会设置一个表示链的对象，并通过该映射器链传递每个输入键/值对。
MultithreadedMapper允许多线程执行另一个Mapper类。那个Mapper类必须是线程安全的。覆盖run启动多个线程迭代输入键值对并将它们传递给底层Mapper。

是否有人在您的实现中覆盖了Mapper run（Context）方法？

1 个答案: