Hadoop reducer中的奇怪错误

时间:2013-12-19 23:26:27

标签: hadoop mapreduce

map-reduce作业中的reducer如下:

    public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {

    public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();

        while(values.hasNext()){                
            Neighbourhood n = values.next();                
            cachedValues.add(n);    
            //correct output
            //output.collect(new Text(n.source), new Text(n.neighbours));
        }

        for(Neighbourhood node:cachedValues){
            //wrong output               
            output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
        }           
    }       
}

Neighbourhood类有两个属性sourceneighbours,属于Text类型。该reducer接收一个具有19个值(类型为Neighbourhood)的键。当我在while循环中输出sourceneighbours时,我得到了19个不同值的实际值。但是,如果我在while循环之后输出它们,如代码所示,我得到19个相似的值。也就是说,一个对象输出19次!发生了什么事情,这是非常有意义的。对此有什么想法吗?

以下是类Neighbourhood

的代码
   public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {

    Text source ;
    Text neighbours ;

    public Neighbourhood(){
        source = new Text();
        neighbours = new Text();
    }

    public Neighbourhood (String s, String n){
        source = new Text(s);
        neighbours = new Text(n);
    }

    @Override
    public void readFields(DataInput arg0) throws IOException {

        source.readFields(arg0);
        neighbours.readFields(arg0);

    }

    @Override
    public void write(DataOutput arg0) throws IOException {

        source.write(arg0);
        neighbours.write(arg0);
    }

    @Override
    public int compareTo(Neighbourhood o) {         
        return 0;
    }

}

1 个答案:

答案 0 :(得分:4)

你被Hadoop采用的效率机制 - 对象重用所困扰。

您对values.next()的调用每次返回相同的对象引用,所有Hadoop在幕后执行操作都会替换同一对象的内容与底层字节(使用readFields()方法反序列化)

为避免这种情况,您需要创建从values.next()返回的对象的深层副本--Hadoop实际上有一个实用程序类为您执行此操作,名为ReflectionUtils.copy。一个简单的解决方法如下:

while(values.hasNext()){                
    Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
    ReflectionUtils.copy(values.next(), n, conf);

您需要缓存作业配置的版本(上面代码中的conf),您可以通过覆盖Reducer中的configure(JobConf)方法来获取:

@Override
protected void configure(JobConf job) {
    conf = job;
}

请注意 - 以这种方式累积列表通常是导致工作中出现内存问题的原因,特别是如果给定单个键的值超过100,000个。