我在Hadoop 1.2.1中实现了一个算法,其中reducer代码依赖于二级排序。但是,当我运行算法时,一个reducer接收已排序的元组,但另一个不接受。我花了很多时间试图找出原因,但没有任何成功。
有谁知道可能是什么问题?我认为它与次要排序代码有关。
以下是实现二级排序的代码:
public class CompositeKey implements WritableComparable<CompositeKey>{
public String key;
public Integer position;
@Override
public void readFields(DataInput arg0) throws IOException {
key = WritableUtils.readString(arg0);
position = arg0.readInt();
}
@Override
public void write(DataOutput arg0) throws IOException {
WritableUtils.writeString(arg0, key);
arg0.writeLong(position);
}
@Override
public int compareTo(CompositeKey o) {
int result = key.compareTo(o.key);
if(0 == result) {
result = position.compareTo(o.position);
}
return result;
}
}
public class CompositeKeyComparator extends WritableComparator {
protected CompositeKeyComparator() {
super(CompositeKey.class, true);
}
@SuppressWarnings("rawtypes")
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
CompositeKey k1 = (CompositeKey)w1;
CompositeKey k2 = (CompositeKey)w2;
int result = k1.key.compareTo(k2.key);
if(0 == result) {
result = -1* k1.position.compareTo(k2.position);
}
return result;
}
}
public class NaturalKeyGroupingComparator extends WritableComparator {
protected NaturalKeyGroupingComparator() {
super(CompositeKey.class, true);
}
@SuppressWarnings("rawtypes")
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
CompositeKey k1 = (CompositeKey)w1;
CompositeKey k2 = (CompositeKey)w2;
return k1.key.compareTo(k2.key);
}
}
public class NaturalKeyPartitioner extends Partitioner<CompositeKey, ReduceValue> {
@Override
public int getPartition(CompositeKey key, ReduceValue val, int numPartitions) {
int hash = key.key.hashCode();
int partition = hash & Integer.MAX_VALUE % numPartitions;
return partition;
}
//secondary sort
job.setPartitionerClass(NaturalKeyPartitioner.class);
job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
如果我在伪分布式环境或集群上执行此操作,我会注意到一个reducer会获得排序元组,而另一个则不会。例如,这里有一个摘录,显示两个Reducer收到的元组(第一列是主要的,第二列是次要的):
First reducer:
a1 0
a1 1
a1 11
a1 16
a1 27
a1 28
a1 34
a1 35
a1 37
a1 38
a1 43
a1 44
a1 46
a1 48
a1 50
a1 54
a1 55
a1 56
a1 57
a1 60
a1 61
a1 63
a1 64
a1 66
a1 69
a1 70
a1 72
a1 75
a1 76
a1 78
a1 79
a1 80
a1 84
a1 85
a1 86
a1 87
a1 88
a1 91
a1 92
a1 97
a1 102
a1 106
a1 108
a1 109
a1 110
a1 111
a1 116
a1 118
a1 119
a1 120
Second reducer:
a2 87
a2 115
a2 65
a2 90
a2 68
a2 119
a2 91
a2 0
a2 70
a2 3
a2 8
a2 9
a2 10
a2 71
a2 110
a2 16
a2 17
a2 20
a2 21
a2 23
a2 26
a2 72
a2 27
a2 94
a2 29
a2 30
a2 31
a2 75
a2 95
a2 36
a2 76
a2 117
a2 39
a2 40
a2 41
a2 42
a2 97
a2 79
a2 44
a2 45
a2 98
a2 46
a2 80
a2 49
a2 82
a2 50
a2 83
a2 100
a2 84
a2 112
a2 57
a2 59
a2 113
a2 60
a2 114
a2 61
答案 0 :(得分:1)
我认为这是因为在CompositeKey的序列化/反序列化逻辑中,您将位置写为long,但将其作为整数读取。这会弄乱比较逻辑,因为你没有测试你写入上下文的完全相同的东西。