Question

我在hadoop中进行了练习，用于对对象进行排序＆＃39; IntPair＆＃39;这是2个整数的组合。这是输入文件：

2,9
3,8
2,6
3,2
...

Class＆＃39; IntPair＆＃39;是这样的：

static class IntPair implements WritableComparable<IntPair> {
    private int first;
    private int second;   
       ...
   public int compareTo(IntPair o) {
       return (this.first==o.first)?(this.second==o.second?0:(this.second>o.second?1:-1)):(this.first>o.first?1:-1);
    }
   public static int compare(int a, int b) {
   return (a==b)?0:((a>b)?1:-1);
   }
       ...  
}

在Mapper中，我使用inputFormat和outputKey / Value，只创建每行2个整数的IntPair实例：

protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
            String v[] = value.toString().split(",");
            IntPair k = new IntPair(Integer.parseInt(v[0]), Integer.parseInt(v[1]));
            context.write(k, NullWritable.get());

        }

我根据第一个整数对映射器结果进行分区，并根据第一个整数创建组比较器。只有排序比较器基于两个整数。

static class FirstPartitioner extends Partitioner<IntPair, NullWritable> {

    public int getPartition(IntPair key, NullWritable value, int numPartitions) {
            return Math.abs(key.getFirst()*127)%numPartitions;
        }
}
static class BothComparator extends WritableComparator {
    public int compare(WritableComparable w1, WritableComparable w2) {
            IntPair p1 = (IntPair)w1;
            IntPair p2 = (IntPair)w2;
            int cmp = IntPair.compare(p1.getFirst(), p2.getFirst());
            if(cmp != 0) {
                return cmp;
            }
            return -IntPair.compare(p1.getSecond(), p2.getSecond());//reverse sort
    }

}

static class FirstGroupComparator extends WritableComparator {
    public int compare(WritableComparable w1, WritableComparable w2) {
            IntPair p1 = (IntPair)w1;
            IntPair p2 = (IntPair)w2;
            return IntPair.compare(p1.getFirst(), p2.getFirst());
    }
}

在Reducer中，我只输出IntPair作为键，将NullWritable输出为值：

static class SSReducer extends Reducer<IntPair, NullWritable, IntPair, NullWritable> {
        protected void reduce(IntPair key, Iterable<NullWritable> values,
            Context context)throws IOException, InterruptedException {
            context.write(key, NullWritable.get());
        }
}

在运行hadoop之后，我得到了以下结果：

   2,9
   3,8

早些时候，我原以为减速器应该按键（IntPair）对记录进行分组。由于每条记录代表一个不同的密钥，因此每条记录都会调用该方法＆＃39; reduce＆＃39;一次，在这种情况下，结果应该是：

2,9
2,6
3,8
3,2

所以我认为由于组比较器存在差异，因为它仅使用第一个整数进行比较。因此在reducer中，记录按第一个整数分组。在这个例子中，它意味着2个记录中的每个记录都会调用“减少”。一次，所以没有循环它只产生每组的第一个记录。这样对吗？另外，我做了另一个实验，它改变了减速器如下：

static class SSReducer extends Reducer<IntPair, NullWritable, IntPair, NullWritable> {
     protected void reduce(IntPair key, Iterable<NullWritable> values,
                Context context)throws IOException, InterruptedException {
                        for(NullWritable n : values) //add looping
                   context.write(key, NullWritable.get());
            }
    }

然后它产生的结果中有4个项目。

如果我更改groupcomparator以使用两个整数进行比较，它也会产生4个项目。因此，reducer实际上使用groupcomparator对键进行分组，这意味着一个组中的每个记录都会调用＆＃39; reduce＆＃39;即使钥匙不同也是如此。

Answer 1

是的，即使密钥不同，一个组中的每个记录都会调用'reduce'一次。实际上每个组调用reduce方法一次，组中的第一个键为“KEY”，组中的所有值都形成reduce方法的值。

即使我们在reduce方法中只有一个键（第一个键），并且所有值都是可迭代的，你可以看到迭代时我们将得到迭代中值的相应键。

首先我们使用两个键转到groupcomparator并启动reduce方法，并从迭代器内部再次使用另外两个键调用组comperator。

这意味着reducer不会提前知道它的可迭代值。它是在迭代可迭代值时确定的。

因此，如果我们不迭代这些值，我们只会看到该组的第一个键。如果我们迭代这些值，我们将获得所有键。

Answer 2

您的理解是正确的。键的“复合值”对进入reducer的分组没有影响。这是比较器的特定行为以及它们所看到的特定字段，这些行为使得它成为一种尊重。

一组中的所有记录都会调用'reduce'一次吗？

2 个答案: