Reducer可迭代值似乎在Java MapReduce中不一致

时间:2017-02-19 03:30:02

标签: java hadoop mapreduce iterator

我的reduce函数中有以下代码。当我尝试使用CollectionUtils.addAll创建浅拷贝时,副本不成功;所有项目都将引用最后一项而不是迭代器中的其他项目。

这是我的Reducer中的代码:

public void reduce(Text key, Iterable<ArrayListWritable<Writable>> values, Context context)
    throws IOException, InterruptedException {
    ArrayList<ArrayListWritable<Writable>> listOfWordPairs = new ArrayList<ArrayListWritable<Writable>>();

    // CollectionUtils.addAll(listOfWordPairs, values.iterator());
    // listOfWordPairs seems to all be the last item in the iterator

    Iterator<ArrayListWritable<Writable>> iter = values.iterator();

    // Manually do the copy
    while (iter.hasNext()) {
        // listOfWordPairs.add(iter.next()); 
        //Same behaviour as CollectionUtils.addAll()

        listOfWordPairs.add(new ArrayListWritable<Writable>(iter.next())); 
        //Only working way to do it -> deep copy :(
        }
    }

任何人都知道为什么会这样吗?我可以看到,如果MR以这种方式实现它,它可以节省相当大的内存,但似乎有一些魔法可以实现它。我是MR的新手,所以希望问题不是太愚蠢......

这是我感兴趣的人的MAP代码

@Override
        public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
          Map<String, HMapStFW> stripes = new HashMap<>();

          List<String> tokens = Tokenizer.tokenize(value.toString());

          if (tokens.size() < 2) return;
          context.getCounter(StripesPmiEnums.TOTALENTRIES).increment(tokens.size());

          for (int i = 0; i < tokens.size() && i<40; i++) {
            for (int j = 0;j<tokens.size() && j<40;j++){
                if (j == i)
                    continue;
                //Make Stripe if doesn't exist
                if (!stripes.containsKey(tokens.get(i))){
                    HMapStFW newStripe = new HMapStFW();
                    stripes.put(tokens.get(i), newStripe);
                }

                HMapStFW stripe = stripes.get(tokens.get(i));
                if (stripe.containsKey(tokens.get(j))){
                    stripe.put(tokens.get(j), stripe.get(tokens.get(j))+1.0f);
                }else{
                    stripe.put(tokens.get(j), 1.0f);
                }
            }
          }

          for (String word1 : stripes.keySet()) {
            TEXT.set(word1);
            context.write(TEXT, stripes.get(word1));
          }
        }

此处还提供了ArrayListWritable https://github.com/lintool/tools/blob/master/lintools-datatypes/src/main/java/tl/lin/data/array/ArrayListWritable.java

1 个答案:

答案 0 :(得分:0)

这是因为迭代器在reducer中的工作方式不同。简而言之,您必须在迭代迭代器

时克隆您的对象
while (iter.hasNext()) {
    //this is correct
    listOfWordPairs.add(new ArrayListWritable<Writable>(iter.next())); 

    }
}

看一下以下链接,很好解释

https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/