Hadoop:辅助排序不起作用

时间:2014-03-21 15:41:45

标签: java sorting hadoop mapreduce

我在Hadoop 1.2.1中实现了一个算法,其中reducer代码依赖于二级排序。但是,当我运行算法时,一个reducer接收已排序的元组,但另一个不接受。我花了很多时间试图找出原因,但没有任何成功。

有谁知道可能是什么问题?我认为它与次要排序代码有关。

以下是实现二级排序的代码:

复合键

    public class CompositeKey implements WritableComparable<CompositeKey>{
        public String key;
        public Integer position;
        @Override
        public void readFields(DataInput arg0) throws IOException {
            key = WritableUtils.readString(arg0);
            position = arg0.readInt();
        }
        @Override
        public void write(DataOutput arg0) throws IOException {
            WritableUtils.writeString(arg0, key);
            arg0.writeLong(position);
        }
        @Override
        public int compareTo(CompositeKey o) {
            int result = key.compareTo(o.key);
            if(0 == result) {
                result = position.compareTo(o.position);
            }
            return result;
        }
    }

KeyComparator

    public class CompositeKeyComparator extends WritableComparator {
         protected CompositeKeyComparator() {
                super(CompositeKey.class, true);
            }   
            @SuppressWarnings("rawtypes")
            @Override
            public int compare(WritableComparable w1, WritableComparable w2) {
                CompositeKey k1 = (CompositeKey)w1;
                CompositeKey k2 = (CompositeKey)w2;

                int result = k1.key.compareTo(k2.key);
                if(0 == result) {
                    result = -1* k1.position.compareTo(k2.position);
                }
                return result;
            }

    }

分组比较器

    public class NaturalKeyGroupingComparator extends WritableComparator {
        protected NaturalKeyGroupingComparator() {
            super(CompositeKey.class, true);
        }   
        @SuppressWarnings("rawtypes")
        @Override
        public int compare(WritableComparable w1, WritableComparable w2) {
            CompositeKey k1 = (CompositeKey)w1;
            CompositeKey k2 = (CompositeKey)w2;

            return k1.key.compareTo(k2.key);
        }

    }

分区

    public class NaturalKeyPartitioner extends Partitioner<CompositeKey, ReduceValue> {
        @Override
        public int getPartition(CompositeKey key, ReduceValue val, int numPartitions) {
            int hash = key.key.hashCode();
            int partition = hash & Integer.MAX_VALUE % numPartitions;
            return partition;
        }

作业配置

    //secondary sort
    job.setPartitionerClass(NaturalKeyPartitioner.class);
    job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
    job.setSortComparatorClass(CompositeKeyComparator.class);

如果我在伪分布式环境或集群上执行此操作,我会注意到一个reducer会获得排序元组,而另一个则不会。例如,这里有一个摘录,显示两个Reducer收到的元组(第一列是主要的,第二列是次要的):

    First reducer:
    a1 0 
    a1 1 
    a1 11 
    a1 16 
    a1 27 
    a1 28 
    a1 34 
    a1 35 
    a1 37 
    a1 38 
    a1 43 
    a1 44 
    a1 46 
    a1 48 
    a1 50 
    a1 54 
    a1 55 
    a1 56 
    a1 57 
    a1 60 
    a1 61 
    a1 63 
    a1 64 
    a1 66 
    a1 69 
    a1 70 
    a1 72 
    a1 75 
    a1 76 
    a1 78 
    a1 79 
    a1 80 
    a1 84 
    a1 85 
    a1 86 
    a1 87 
    a1 88 
    a1 91 
    a1 92 
    a1 97 
    a1 102   
    a1 106    
    a1 108  
    a1 109 
    a1 110 
    a1 111 
    a1 116     
    a1 118  
    a1 119 
    a1 120  

    Second reducer:
    a2 87 
    a2 115
    a2 65 
    a2 90 
    a2 68 
    a2 119    
    a2 91 
    a2 0 
    a2 70 
    a2 3 
    a2 8 
    a2 9 
    a2 10 
    a2 71 
    a2 110   
    a2 16 
    a2 17 
    a2 20 
    a2 21 
    a2 23 
    a2 26 
    a2 72 
    a2 27 
    a2 94 
    a2 29 
    a2 30 
    a2 31 
    a2 75 
    a2 95 
    a2 36 
    a2 76 
    a2 117  
    a2 39 
    a2 40 
    a2 41 
    a2 42 
    a2 97 
    a2 79 
    a2 44 
    a2 45 
    a2 98 
    a2 46 
    a2 80 
    a2 49 
    a2 82 
    a2 50 
    a2 83 
    a2 100 
    a2 84 
    a2 112     
    a2 57 
    a2 59 
    a2 113      
    a2 60 
    a2 114       
    a2 61 

1 个答案:

答案 0 :(得分:1)

我认为这是因为在CompositeKey的序列化/反序列化逻辑中,您将位置写为long,但将其作为整数读取。这会弄乱比较逻辑,因为你没有测试你写入上下文的完全相同的东西。