我有两个mapper类,它们只是创建键值对,我的主要逻辑应该在reducer部分。我正在尝试比较来自两个不同文本文件的数据。
我的mapper类是
public static class Map extends
Mapper<LongWritable, Text, Text, Text> {
private String ky,vl="a";
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String tokens[] = line.split("\t");
vl = tokens[1].trim();
ky = tokens[2].trim();
//sending key-value pairs to the reducer
context.write(new Text(ky),new Text(vl));
}
}
我的第二个映射器是
public static class Map2 extends
Mapper<LongWritable, Text, Text, Text> {
private String ky2,vl2 = "a";
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String tokens[] = line.split("\t");
vl2 = tokens[1].trim();
ky2 = tokens[2].trim();
//sending key-value pairs to the reducer
context.write(new Text(ky2),new Text(vl2));
}
}
Reducer类是
public static class Reduce extends
Reducer<Text, Text, Text, Text> {
private String rslt = "l";
public void reduce(Text key, Iterator<Text> values,Context context) throws IOException, InterruptedException {
int count = 0;
while(values.hasNext()){
count++;
}
rslt = Integer.toString(count);
if(count>1){
context.write(key,new Text(rslt));
}
}
}
我的主要方法是
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(CompareTwoFiles.class);
job.setJobName("Compare Two Files and Identify the Difference");
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, Map2.class);
job.waitForCompletion(true);
输出
File System Counters
FILE: Number of bytes read=361621
FILE: Number of bytes written=1501806
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=552085
HDFS: Number of bytes written=150962
HDFS: Number of read operations=28
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=10783
Map output records=10783
Map output bytes=150962
Map output materialized bytes=172540
Input split bytes=507
Combine input records=0
Combine output records=0
Reduce input groups=7985
Reduce shuffle bytes=172540
Reduce input records=10783
Reduce output records=10783
Spilled Records=21566
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=12
Total committed heap usage (bytes)=928514048
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=150962