Question

我对reducer的理解是它处理了sort和shuffle的中间o / p文件中的一个键值对。我不知道如何访问具有排序＆amp;的中间文件。改组的键值对。一旦我无法访问中间文件，我就无法在reducer模块中编写代码来选择最大的密钥。我不知道如何编程减速器，它一次接收一个K，V对，只打印最大的键及其相应的值到最终的输出文件。

假设这是来自mapper的中间文件，它也经过排序和改组..

1 a

2是

4这是什么

我希望reducer在最终输出文件中只打印“4 this what”。由于reducer在其内存中没有整个文件。它不可能在reducer中编写这个逻辑。我想知道是否有任何API支持从中间文件中选择最后一行最终将具有最大键（键将默认排序）

OR

我是否必须超越默认排序比较器才能实现我想要实现的目标???

Answer 1

您可以在作业中设置不同的比较器进行排序：

job.setSortComparatorClass(LongWritable.DecreasingComparator.class);

例如，这将按LongWritable键逐渐排序。

Answer 2

一个简单的解决方案是拥有一个Reducer（所有键值对都可以使用），并让它跟踪最大的键。

IntWritable currentMax = new IntWritable(-1);

public void reduce(IntWritable key, Iterable<Text> values, Context context) {
  if (key.compareTo(currentMax) > 0) {
    currentMax.set(key.get());
    // copy 'values' somewhere
  }
}

public void cleanup(Context context) {
  Text outputValue = //create output from saved max values;
  context.emit(currentMax, outputValue);
}

另一个优化是，只能以相同的方式从Mapper发出最大键，或者将此Reducer实现用作Combiner类。

Answer 3

感谢Thomas Jungblut更好的方法。

对于你的司机：

job.setSortComparatorClass(IntWritable.DecreasingComparator.class);

对于你的减速机：

boolean biggestKeyDone = false;

public void reduce(IntWritable key, Iterable<Text> values, Context context) {
    if (!biggestKeyDone){
        // write or whatever with the values of the biggest key
        biggestKeyDone = true;
    }
}

Answer 4

如果您只想在reducer中写入最大键的值，我建议在配置中保存映射器中检测到的最大键。像这样：

Integer currentMax = null;

public void map(IntWritable key, Text value, Context){
    if (currentMax == null){
        currentMax = key.intValue();
    }else{
        currentMax = Math.max(currentMax.intValue(), key.get());
    }
    context.write(key, value);
}

protected void cleanup(){
    if (currentMax!=null){
        context.getConfiguration().set("biggestKey", currentMax.toString());
    }
}

然后，在你的减速器中：

int biggestKey = -1;
protected void setup(Context context){
    biggestKey = Integer.parseInt(context.getConfiguration().get("biggestKey"));
}

public void reduce(IntWritable key, Iterable<Text> values, Context context) {
  if (biggestKey == key.get()) {
    // write or whatever with the values of the biggest key
  }
}

这样可以避免浪费内存和复制值的时间。

在reducer功能中选择最大键

4 个答案: