Question

如果我对Key使用两个不同的类，则mapreduce作业会有一个小问题，消息数量不同。问题在于，关于Mapreduce的类应该基本相同。

案例1：

class Key1{
    private LongWritable id = new LongWritable();
    private LongWritable timestamp = new LongWritable();
}

有了这个Key类，MR作业将产生9871431结果

情况2：

class Key1{
    private LongWritable id = new LongWritable();
    private LongWritable timestamp = new LongWritable();
    private LongWritable field1 = new LongWritable();
    .....
    private LongWritable fieldN = new LongWritable();
}

区别仅在于Key具有更多字段。到目前为止，可以解释不同的结果但是：我的分组比较器仅使用ID字段

public class NaturalKeyGroupingComparator extends WritableComparator {

    protected NaturalKeyGroupingComparator() {
        super(Key1.class, true);
    }
    @SuppressWarnings("rawtypes")
    public int compare(WritableComparable a, WritableComparable b) {
        Key1 p1 = (Key) a;
        Key p2 = (Key) b;

        return p1.getId().compareTo(p2.getId());
    }
}

与我的分区器相同

public class MMSIPairPartitioner extends Partitioner<Key1, TrackingPair> {

    public int getPartition(Key1 key, TrackingPair value, int numPartitions) {
        return key.getId().hashCode() % numPartitions;
    }
}

所以我的想法是，我可以使用Key1生成的文档数量应该与用于确定分区的Key1和Key2完全相同。

但是Key1 produces 2 documents less。

两种情况下的归约部分相同。

两个Key实现都具有这种比较方法，就像我在使用二级排序一样

public int compareTo(Key pair) {
        int compareValue = this.id.compareTo(pair.getId());
        if (compareValue == 0) {
            compareValue = this.timestamp.compareTo(pair.getTimestamp());
        }
        return compareValue;
    }

那为什么具有更多字段（但不用于排序或分区）的Key类产生不同数量的结果（在这种情况下少2个）？

Mapreduce不同的键产生不同数量的消息

0 个答案: