我在编写MapReduce程序时遇到问题,我的输入文件被程序读取两次。已经完成了这个why is my sequence file being read twice in my hadoop mapper class?答案,但不幸的是它没有帮助
我的Mapper类是:
package com.siddu.mapreduce.csv;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class SidduCSVMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>
{
IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text line,
OutputCollector<Text, IntWritable> output, Reporter report)
throws IOException
{
String lineCSV= line.toString();
String[] tokens = lineCSV.split(";");
output.collect(new Text(tokens[2]), one);
}
}
我的Reducer课程是:
package com.siddu.mapreduce.csv;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class SidduCSVReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable>
{
@Override
public void reduce(Text key, Iterator<IntWritable> inputFrmMapper,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
System.out.println("In reducer the key is:"+key.toString());
int relationOccurance=0;
while(inputFrmMapper.hasNext())
{
IntWritable intWriteOb = inputFrmMapper.next();
int val = intWriteOb.get();
relationOccurance += val;
}
output.collect(key, new IntWritable(relationOccurance));
}
}
最后我的驱动程序类是:
package com.siddu.mapreduce.csv;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class SidduCSVMapReduceDriver
{
public static void main(String[] args)
{
JobClient client = new JobClient();
JobConf conf = new JobConf(com.siddu.mapreduce.csv.SidduCSVMapReduceDriver.class);
conf.setJobName("Siddu CSV Reader 1.0");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(com.siddu.mapreduce.csv.SidduCSVMapper.class);
conf.setReducerClass(com.siddu.mapreduce.csv.SidduCSVReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
client.setConf(conf);
try
{
JobClient.runJob(conf);
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
答案 0 :(得分:0)
你应该知道hadoop会产生多次尝试任务,通常每个映射器有两次。如果您看到两次日志文件输出,那可能就是原因。