我正在编写一个简单的MapReduce程序,用于计算每行在输入中出现的次数。我的目标是检查两个目录是否包含相同的数据。因此,在reduce阶段,我的目标是检查每个键是否恰好出现两次(每个输入目录中有一个)
这是我的代码 -
public class ResultsValidator extends Configured implements Tool {
public static class TuplesScanner extends Mapper<BytesWritable, NullWritable, BytesWritable, LongWritable> {
private LongWritable one = new LongWritable(1);
@Override
public void map(BytesWritable row, NullWritable ignored, Context context) throws IOException, InterruptedException {
context.write(row, one);
}
}
public static class TuplesCombiner extends Reducer<BytesWritable, LongWritable, BytesWritable, LongWritable> {
@Override
public void reduce(BytesWritable row, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
context.write(row, new LongWritable(sum));
}
}
public static class TuplesReducer extends Reducer<BytesWritable, LongWritable, BytesWritable, NullWritable> {
@Override
public void reduce(BytesWritable row, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
if (sum != 2) {
context.write(row, NullWritable.get());
}
}
}
public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Job job = Job.getInstance(getConf());
Path inputDir0 = new Path(args[0]);
Path inputDir1 = new Path(args[1]);
Path outputDir = new Path(args[2]);
int reducersNum = Integer.parseInt(args[3]);
if (outputDir.getFileSystem(getConf()).exists(outputDir)) {
throw new IOException("Output directory " + outputDir +
" already exists.");
}
FileInputFormat.addInputPath(job, inputDir0);
FileInputFormat.addInputPath(job, inputDir1);
FileOutputFormat.setOutputPath(job, outputDir);
job.setJobName("ResultsValidator");
job.setJarByClass(ResultsValidator.class);
job.setMapperClass(TuplesScanner.class);
job.setCombinerClass(TuplesCombiner.class);
job.setReducerClass(TuplesReducer.class);
job.setNumReduceTasks(reducersNum);
job.setMapOutputKeyClass(BytesWritable.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(BytesWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setInputFormatClass(ResultsValidatorInputFormat.class);
job.setOutputFormatClass(ResultsValidatorOutputFormat.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ResultsValidator(), args);
System.exit(res);
}
}
我无法找到在reduce阶段的iterable中得到错误数字的原因。在日志中,我发现每个reducer获得的数字等于合并后的shuffle数。
我哪里错了?