Question

map-reduce作业中的reducer如下：

    public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {

    public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();

        while(values.hasNext()){                
            Neighbourhood n = values.next();                
            cachedValues.add(n);    
            //correct output
            //output.collect(new Text(n.source), new Text(n.neighbours));
        }

        for(Neighbourhood node:cachedValues){
            //wrong output               
            output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
        }           
    }       
}

Neighbourhood类有两个属性source和neighbours，属于Text类型。该reducer接收一个具有19个值（类型为Neighbourhood）的键。当我在while循环中输出source和neighbours时，我得到了19个不同值的实际值。但是，如果我在while循环之后输出它们，如代码所示，我得到19个相似的值。也就是说，一个对象输出19次！发生了什么事情，这是非常有意义的。对此有什么想法吗？

以下是类Neighbourhood

的代码

   public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {

    Text source ;
    Text neighbours ;

    public Neighbourhood(){
        source = new Text();
        neighbours = new Text();
    }

    public Neighbourhood (String s, String n){
        source = new Text(s);
        neighbours = new Text(n);
    }

    @Override
    public void readFields(DataInput arg0) throws IOException {

        source.readFields(arg0);
        neighbours.readFields(arg0);

    }

    @Override
    public void write(DataOutput arg0) throws IOException {

        source.write(arg0);
        neighbours.write(arg0);
    }

    @Override
    public int compareTo(Neighbourhood o) {         
        return 0;
    }

}

Answer 1

你被Hadoop采用的效率机制 - 对象重用所困扰。

您对values.next()的调用每次返回相同的对象引用，所有Hadoop在幕后执行操作都会替换同一对象的内容与底层字节（使用readFields()方法反序列化）

为避免这种情况，您需要创建从values.next()返回的对象的深层副本--Hadoop实际上有一个实用程序类为您执行此操作，名为ReflectionUtils.copy。一个简单的解决方法如下：

while(values.hasNext()){                
    Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
    ReflectionUtils.copy(values.next(), n, conf);

您需要缓存作业配置的版本（上面代码中的conf），您可以通过覆盖Reducer中的configure(JobConf)方法来获取：

@Override
protected void configure(JobConf job) {
    conf = job;
}

请注意 - 以这种方式累积列表通常是导致工作中出现内存问题的原因，特别是如果给定单个键的值超过100,000个。

Hadoop reducer中的奇怪错误

1 个答案: