Question

我发现这种差异使我感到困惑

最初，我想通过在每一步累加1来计算reducer中的记录数，代码如下：

输入对是<Text, DoubleWritable>，所有记录都具有相同的键“1”。有160000条记录

public void reduce(Text key, Iterator<DoubleWritable> values, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
    double count = 0;   
    while(values.hasNext()){
        count = count + 1;
    }
    output.collect(new Text("Count"), new DoubleWritable(count));
}

输出为22

将减速器的输入更改为 <Text, Text>，所有记录“1”的键值相同，值为“1”

代码变为：

public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
    double count = 0;
    String s = "";  
    while(values.hasNext()){
        s = values.next().toString();
        count = count + Integer.parseInt(s);
    }

    output.collect(new Text("Count"), new DoubleWritable(count));
}

现在答案是正确的：160000

看起来while循环的迭代次数在每种情况下都应该相同。为什么结果不同？

Answer 1

这里的问题是你的两个例子的逻辑实际上是不同的。在第一种情况下，您只计算传递给reducer的键/值对的数量。

为了在逻辑上等效，您需要将count = count + 1更改为count = count + iter.next().get()才能获得值的总和。

原因在于您的减速机也是组合器。因此，当键/值对到达减速器时，它们已经被部分求和（组合）。

Count 1700
Count 42
Count 5640
...

Hadoop导致混乱

1 个答案: