Question

在将输出从mapper传递到reducer之前，mapreduce中的排序是如何发生的。如果我的mapper输出键的类型是IntWritable，它是否使用IntWritable类中定义的比较器或类中的compareTo方法，如果是，则调用是如何进行的。如果不是如何执行排序，如何进行调用？

Answer 1

首先收集地图作业输出，然后将其发送到分区程序，负责确定将向哪个Reducer发送数据（它尚未通过reduce()呼叫进行分组）。默认的分区程序使用Key的hashCode()方法和使用Reducers数量的模数来做。

之后，将调用Comparator对Map输出执行排序。 Flow看起来像这样：

收藏家 - ＆gt;分区程序 - ＆gt;溢出 - ＆gt;比较器 - ＆gt;本地磁盘（HDFS）＆lt; - MapOutputServlet

然后，每个Reducer将从分区程序分配给它的映射器中复制数据，并将其传递给Grouper，Grouper将确定如何为单个Reducer函数调用分组记录：

MapOutputServlet - ＆gt;复制到本地磁盘（HDFS） - ＆gt;组 - ＆gt;减少

在函数调用之前，记录还将经过排序阶段以确定它们到达reducer的顺序。排序器（WritableComparator()）将调用密钥的compareTo()（WritableComparable()接口）方法。

为了让您更好地了解，以下是如何为自定义组合键实现基本compareTo()，分组器和分类器：

public class CompositeKey implements WritableComparable<CompositeKey> {
    IntWritable primaryField = new IntWritable();
    IntWritable secondaryField = new IntWritable();

    public CompositeKey(IntWritable p, IntWritable s) {
        this.primaryField.set(p);
        this.secondaryField = s;
    }

    public void write(DataOutput out) throws IOException {
        this.primaryField.write(out);
        this.secondaryField.write(out);
    }

    public void readFields(DataInput in) throws IOException {
        this.primaryField.readFields(in);
        this.secondaryField.readFields(in);
    }

    // Called by the partitionner to group map outputs to same reducer instance
    // If the hash source is simple (primary type or so), a simple call to their hashCode() method is good enough
    public int hashCode() {
        return this.primaryField.hashCode();
    }

    @Override
    public int compareTo(CompositeKey other) {
        if (this.getPrimaryField().equals(other.getPrimaryField())) {
            return this.getSecondaryField().compareTo(other.getSecondaryField());
        } else {
            return this.getPrimaryField().compareTo(other.getPrimaryField());
        }
    }
}

public class CompositeGroupingComparator extends WritableComparator {
    public CompositeGroupingComparator() {
        super(CompositeKey.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        CompositeKey first = (CompositeKey) a;
        CompositeKey second = (CompositeKey) b;

        return first.getPrimaryField().compareTo(second.getPrimaryField());
    }
}

public class CompositeSortingComparator extends WritableComparator {
    public CompositeSortingComparator() {
        super (CompositeKey.class, true);
    }

    @Override
    public int compare (WritableComparable a, WritableComparable b){
        CompositeKey first = (CompositeKey) a;
        CompositeKey second = (CompositeKey) b;

        return first.compareTo(second);
    }
}

Answer 2

Mapper框架负责比较我们所有的默认数据类型，如IntWritable，DoubleWritable e.t.c ......但是如果你有一个用户定义的keytype，你需要实现WritableComparable接口。

WritableComparables可以相互比较，通常通过Comparators进行比较。在Hadoop Map-Reduce框架中用作密钥的任何类型都应实现此接口。

请注意，hashCode（）经常在Hadoop中用于分区键。重要的是，您的hashCode（）实现会在JVM的不同实例中返回相同的结果。另请注意，Object中的默认hashCode（）实现不满足此属性。

示例：

public class MyWritableComparable implements WritableComparable {
   // Some data
   private int counter;
   private long timestamp;

   public void write(DataOutput out) throws IOException {
     out.writeInt(counter);
     out.writeLong(timestamp);
   }

   public void readFields(DataInput in) throws IOException {
     counter = in.readInt();
     timestamp = in.readLong();
   }

   public int compareTo(MyWritableComparable o) {
     int thisValue = this.value;
     int thatValue = o.value;
     return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
   }

   public int hashCode() {
     final int prime = 31;
     int result = 1;
     result = prime * result + counter;
     result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
     return result
   }
 }

来自：https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html

mapreduce中的默认排序是使用WritableComparable类中定义的Comparator还是comapreTo（）方法？

2 个答案: