Question

所以，我想用MR执行reduce side join。（没有蜂巢或任何东西，我正在试验香草Hadoop atm）。

我有2个输入文件，首先是这样的：
12 13
12 15
12 16
12 23

第二个只是12 1000。

因此，我将每个文件分配给一个单独的映射器，该映射器实际上将每个键值对标记为0或1，具体取决于其源文件。这很有效。我怎么知道？我按预期得到了MapOutput：

|关键| |值|
12 0 1000
12 1 13
12 1 15
12 1 16等

我的分区程序根据密钥的第一部分（即12）进行分区。 Reducer应该按键加入。然而，这项工作似乎跳过了减少步骤。

我想知道我的司机是否有问题？

我的代码（Hadoop v0.22，但是与主干中额外的库有0.20.2相同的结果）：

映射器

public static class JoinDegreeListMapper extends
        Mapper<Text, Text, TextPair, Text> {
    public void map(Text node, Text degree, Context context)
            throws IOException, InterruptedException {

        context.write(new TextPair(node.toString(), "0"), degree);

    }
}

public static class JoinEdgeListMapper extends
        Mapper<Text, Text, TextPair, Text> {
    public void map(Text firstNode, Text secondNode, Context context)
            throws IOException, InterruptedException {

        context.write(new TextPair(firstNode.toString(), "1"), secondNode);

    }
}

减速

public static class JoinOnFirstReducer extends
        Reducer<TextPair, Text, Text, Text> {
    public void reduce(TextPair key, Iterator<Text> values, Context context)
            throws IOException, InterruptedException {

        context.progress();
        Text nodeDegree = new Text(values.next());
        while (values.hasNext()) {
            Text secondNode = values.next();
            Text outValue = new Text(nodeDegree.toString() + "\t"
                    + secondNode.toString());
            context.write(key.getFirst(), outValue);
        }
    }
}

分区

public static class JoinOnFirstPartitioner extends
        Partitioner<TextPair, Text> {

    @Override
    public int getPartition(TextPair key, Text Value, int numOfPartitions) {
        return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numOfPartitions;
    }
}

驱动程序

public int run(String[] args) throws Exception {


    Path edgeListPath = new Path(args[0]);
    Path nodeListPath = new Path(args[1]);
    Path outputPath = new Path(args[2]);

    Configuration conf = getConf();

    Job job = new Job(conf);
    job.setJarByClass(JoinOnFirstNode.class);
    job.setJobName("Tag first node with degree");

    job.setPartitionerClass(JoinOnFirstPartitioner.class);
    job.setGroupingComparatorClass(TextPair.FirstComparator.class);
    //job.setSortComparatorClass(TextPair.FirstComparator.class);
    job.setReducerClass(JoinOnFirstReducer.class);

    job.setMapOutputKeyClass(TextPair.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);


    MultipleInputs.addInputPath(job, edgeListPath, EdgeInputFormat.class,
            JoinEdgeListMapper.class);
    MultipleInputs.addInputPath(job, nodeListPath, EdgeInputFormat.class,
            JoinDegreeListMapper.class);

            FileOutputFormat.setOutputPath(job, outputPath);


    return job.waitForCompletion(true) ? 0 : 1;

}

Answer 1

我的reduce函数有Iterator＆lt;＆gt;而不是Iterable，所以这个工作跳过了Identity Reducer 我不敢相信我忽视了这一点。 Noob错误。

答案来自这个Q / A. Using Hadoop for the First Time, MapReduce Job does not run Reduce Phase

Hadoop - 加入MultipleInputs可能会跳过Reducer

1 个答案: