所以,我想用MR执行reduce side join。 (没有蜂巢或任何东西,我正在试验香草Hadoop atm)。
我有2个输入文件,首先是这样的:
12 13
12 15
12 16
12 23
第二个只是12 1000。
因此,我将每个文件分配给一个单独的映射器,该映射器实际上将每个键值对标记为0或1,具体取决于其源文件。这很有效。我怎么知道? 我按预期得到了MapOutput:
|关键| |值|
12 0 1000
12 1 13
12 1 15
12 1 16等
我的分区程序根据密钥的第一部分(即12)进行分区。 Reducer应该按键加入。然而,这项工作似乎跳过了减少步骤。
我想知道我的司机是否有问题?
我的代码(Hadoop v0.22,但是与主干中额外的库有0.20.2相同的结果):
映射器
public static class JoinDegreeListMapper extends
Mapper<Text, Text, TextPair, Text> {
public void map(Text node, Text degree, Context context)
throws IOException, InterruptedException {
context.write(new TextPair(node.toString(), "0"), degree);
}
}
public static class JoinEdgeListMapper extends
Mapper<Text, Text, TextPair, Text> {
public void map(Text firstNode, Text secondNode, Context context)
throws IOException, InterruptedException {
context.write(new TextPair(firstNode.toString(), "1"), secondNode);
}
}
减速
public static class JoinOnFirstReducer extends
Reducer<TextPair, Text, Text, Text> {
public void reduce(TextPair key, Iterator<Text> values, Context context)
throws IOException, InterruptedException {
context.progress();
Text nodeDegree = new Text(values.next());
while (values.hasNext()) {
Text secondNode = values.next();
Text outValue = new Text(nodeDegree.toString() + "\t"
+ secondNode.toString());
context.write(key.getFirst(), outValue);
}
}
}
分区
public static class JoinOnFirstPartitioner extends
Partitioner<TextPair, Text> {
@Override
public int getPartition(TextPair key, Text Value, int numOfPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numOfPartitions;
}
}
驱动程序
public int run(String[] args) throws Exception {
Path edgeListPath = new Path(args[0]);
Path nodeListPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
Configuration conf = getConf();
Job job = new Job(conf);
job.setJarByClass(JoinOnFirstNode.class);
job.setJobName("Tag first node with degree");
job.setPartitionerClass(JoinOnFirstPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
//job.setSortComparatorClass(TextPair.FirstComparator.class);
job.setReducerClass(JoinOnFirstReducer.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, edgeListPath, EdgeInputFormat.class,
JoinEdgeListMapper.class);
MultipleInputs.addInputPath(job, nodeListPath, EdgeInputFormat.class,
JoinDegreeListMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
}
答案 0 :(得分:0)
我的reduce函数有Iterator&lt;&gt;而不是Iterable,所以这个工作跳过了Identity Reducer 我不敢相信我忽视了这一点。 Noob错误。
答案来自这个Q / A. Using Hadoop for the First Time, MapReduce Job does not run Reduce Phase