mapreduce中的默认排序是使用WritableComparable类中定义的Comparator还是comapreTo()方法?

时间:2015-03-21 14:49:32

标签: hadoop mapreduce

在将输出从mapper传递到reducer之前,mapreduce中的排序是如何发生的。如果我的mapper输出键的类型是IntWritable,它是否使用IntWritable类中定义的比较器或类中的compareTo方法,如果是,则调用是如何进行的。如果不是如何执行排序,如何进行调用?

2 个答案:

答案 0 :(得分:1)

首先收集地图作业输出,然后将其发送到分区程序,负责确定将向哪个Reducer发送数据(它尚未通过reduce()呼叫进行分组)。默认的分区程序使用Key的hashCode()方法和使用Reducers数量的模数来做。

之后,将调用Comparator对Map输出执行排序。 Flow看起来像这样:

收藏家 - >分区程序 - >溢出 - >比较器 - >本地磁盘(HDFS)< - MapOutputServlet

然后,每个Reducer将从分区程序分配给它的映射器中复制数据,并将其传递给Grouper,Grouper将确定如何为单个Reducer函数调用分组记录:

MapOutputServlet - >复制到本地磁盘(HDFS) - >组 - >减少

在函数调用之前,记录还将经过排序阶段以确定它们到达reducer的顺序。排序器(WritableComparator())将调用密钥的compareTo()WritableComparable()接口)方法。

为了让您更好地了解,以下是如何为自定义组合键实现基本compareTo(),分组器和分类器:

public class CompositeKey implements WritableComparable<CompositeKey> {
    IntWritable primaryField = new IntWritable();
    IntWritable secondaryField = new IntWritable();

    public CompositeKey(IntWritable p, IntWritable s) {
        this.primaryField.set(p);
        this.secondaryField = s;
    }

    public void write(DataOutput out) throws IOException {
        this.primaryField.write(out);
        this.secondaryField.write(out);
    }

    public void readFields(DataInput in) throws IOException {
        this.primaryField.readFields(in);
        this.secondaryField.readFields(in);
    }

    // Called by the partitionner to group map outputs to same reducer instance
    // If the hash source is simple (primary type or so), a simple call to their hashCode() method is good enough
    public int hashCode() {
        return this.primaryField.hashCode();
    }

    @Override
    public int compareTo(CompositeKey other) {
        if (this.getPrimaryField().equals(other.getPrimaryField())) {
            return this.getSecondaryField().compareTo(other.getSecondaryField());
        } else {
            return this.getPrimaryField().compareTo(other.getPrimaryField());
        }
    }
}


public class CompositeGroupingComparator extends WritableComparator {
    public CompositeGroupingComparator() {
        super(CompositeKey.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        CompositeKey first = (CompositeKey) a;
        CompositeKey second = (CompositeKey) b;

        return first.getPrimaryField().compareTo(second.getPrimaryField());
    }
}

public class CompositeSortingComparator extends WritableComparator {
    public CompositeSortingComparator() {
        super (CompositeKey.class, true);
    }

    @Override
    public int compare (WritableComparable a, WritableComparable b){
        CompositeKey first = (CompositeKey) a;
        CompositeKey second = (CompositeKey) b;

        return first.compareTo(second);
    }
}

答案 1 :(得分:0)

Mapper框架负责比较我们所有的默认数据类型,如IntWritable,DoubleWritable e.t.c ......但是如果你有一个用户定义的keytype,你需要实现WritableComparable接口。

WritableComparables可以相互比较,通常通过Comparators进行比较。在Hadoop Map-Reduce框架中用作密钥的任何类型都应实现此接口。

请注意,hashCode()经常在Hadoop中用于分区键。重要的是,您的hashCode()实现会在JVM的不同实例中返回相同的结果。另请注意,Object中的默认hashCode()实现不满足此属性。

示例:

public class MyWritableComparable implements WritableComparable {
   // Some data
   private int counter;
   private long timestamp;

   public void write(DataOutput out) throws IOException {
     out.writeInt(counter);
     out.writeLong(timestamp);
   }

   public void readFields(DataInput in) throws IOException {
     counter = in.readInt();
     timestamp = in.readLong();
   }

   public int compareTo(MyWritableComparable o) {
     int thisValue = this.value;
     int thatValue = o.value;
     return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
   }

   public int hashCode() {
     final int prime = 31;
     int result = 1;
     result = prime * result + counter;
     result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
     return result
   }
 }

来自:https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html