Question

我正在使用MapReduce框架在Java中创建Hadoop应用程序。

我只使用输入和输出的文本键和值。在减少到最终输出之前，我使用组合器进行额外的计算步骤。

但我遇到的问题是键不会转到同一个reducer。我在组合器中创建并添加这样的键/值对：

public static class Step4Combiner extends Reducer<Text,Text,Text,Text> {
    private static Text key0 = new Text();
    private static Text key1 = new Text();

        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                key0.set("KeyOne");
                key1.set("KeyTwo");
                context.write(key0, new Text("some value"));
                context.write(key1, new Text("some other value"));
        }

}   

public static class Step4Reducer extends Reducer<Text,Text,Text,Text> {

            public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                System.out.print("Key:" + key.toString() + " Value: ");
                String theOutput = "";
                for (Text val : values) {
                    System.out.print("," + val);
                }
                System.out.print("\n");

                context.write(key, new Text(theOutput));
            }

}

在主要我创建这样的工作：

Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

Job job4 = new Job(conf, "Step 4");
job4.setJarByClass(Step4.class);

job4.setMapperClass(Step4.Step4Mapper.class);
job4.setCombinerClass(Step4.Step4Combiner.class);
job4.setReducerClass(Step4.Step4Reducer.class);

job4.setInputFormatClass(TextInputFormat.class);
job4.setOutputKeyClass(Text.class);
job4.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job4, new Path(outputPath));
FileOutputFormat.setOutputPath(job4, new Path(finalOutputPath));            

System.exit(job4.waitForCompletion(true) ? 0 : 1);

从reducer打印的stdout中的输出是：

Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value
Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value
Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value

这没有任何意义，因为键是相同的，因此它应该是2个reducer，其中3个相同的值是Iterable

希望你能帮我解决这个问题：）

Answer 1

这很可能是因为你的合成器在map和reduce阶段运行（一个鲜为人知的'feature'）。

基本上，您正在修改组合器中的键，它可能会也可能不会运行，因为映射输出在reducer中合并在一起。在组合器运行（减少侧）之后，通过分组比较器输入密钥以确定传递给reduce方法的Iterable返回的值（我在这里绕过reduce阶段的流方面 - 迭代不支持通过一组或一组值，更多调用iterator（）。next（）如果分组比较器确定当前键并且最后一个键是相同的，则返回true）

您可以通过检查上下文（有一个Context.getTaskAttempt().isMap()方法来尝试检测当前组合器相位侧（map或reduce），但我也有一些记忆存在问题，甚至可能存在关于这个的JIRA门票）。

底线，不要修改组合器中的键，除非你可以找到绕过这个bevaviour 如果组合器正在运行reduce侧。

修改所以调查@Amar的评论，我把一些代码（pastebin link）放在一起，它增加了一些详细的比较器，组合器，减速器等。如果你运行一个地图工作，那么在减少阶段没有组合器将运行，并映射输出将不会再次排序，因为它已被假定为已排序。

假设它被排序，因为它在被发送到组合器类之前被排序，并且它假设键将不受影响 - 因此仍然被排序。请记住，组合器用于组合给定键的值。

因此，使用单个映射和给定的组合器，reducer会看到KeyOne，KeyTwo，KeyOne，KeyTwo，KeyOne顺序中的键。分组比较器看到它们之间的转换，因此您可以对reduce函数进行6次调用

如果使用两个映射器，则reducer知道它有两个已排序的段（每个映射一个），因此仍然需要在还原之前对它们进行排序 - 但由于段的数量低于阈值，排序是作为内联流排序完成（再次假设段被排序）。使用两个映射器仍然是错误的输出（从reduce阶段输出10个记录）。

所以再次，不要修改组合器中的键，这不是组合器的用途。

Answer 2

请在组合器中尝试此操作：

context.write(new Text("KeyOne"), new Text("some value"));
context.write(new Text("KeyTwo"), new Text("some other value"));

我发现这种事情发生的唯一方法是，如果发现一个组合器的key0与另一个组合器的key0不相等。我不确定在键指向完全相同的实例的情况下它会如何表现（如果你将键设置为静态会发生这种情况）。

两个相等的组合键不能到达相同的减速器

2 个答案: