Hadoop的MapReduce中的映射器正在读取我的输入文件两次

时间:2014-03-17 20:21:35

标签: java hadoop mapreduce

我在编写MapReduce程序时遇到问题,我的输入文件被程序读取两次。已经完成了这个why is my sequence file being read twice in my hadoop mapper class?答案,但不幸的是它没有帮助

我的Mapper类是:

package com.siddu.mapreduce.csv;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class SidduCSVMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> 
{

    IntWritable one = new IntWritable(1);
    @Override
    public void map(LongWritable key, Text line,
            OutputCollector<Text, IntWritable> output, Reporter report)
            throws IOException 
    {
        String lineCSV= line.toString();

        String[] tokens = lineCSV.split(";");

        output.collect(new Text(tokens[2]), one);
    }

}

我的Reducer课程是:

package com.siddu.mapreduce.csv;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class SidduCSVReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> 
{

    @Override
    public void reduce(Text key, Iterator<IntWritable> inputFrmMapper,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException 
    {
        System.out.println("In reducer the key is:"+key.toString());

        int relationOccurance=0;
        while(inputFrmMapper.hasNext())
        {
            IntWritable intWriteOb = inputFrmMapper.next();
            int val = intWriteOb.get();

            relationOccurance += val;

        }

        output.collect(key, new IntWritable(relationOccurance));

    }



}

最后我的驱动程序类是:

package com.siddu.mapreduce.csv;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class SidduCSVMapReduceDriver 
{
    public static void main(String[] args) 
    {

        JobClient client = new JobClient();
        JobConf conf = new JobConf(com.siddu.mapreduce.csv.SidduCSVMapReduceDriver.class);

        conf.setJobName("Siddu CSV Reader 1.0");

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(com.siddu.mapreduce.csv.SidduCSVMapper.class);
        conf.setReducerClass(com.siddu.mapreduce.csv.SidduCSVReducer.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        client.setConf(conf);

        try
        {
            JobClient.runJob(conf);
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

1 个答案:

答案 0 :(得分:0)

你应该知道hadoop会产生多次尝试任务,通常每个映射器有两次。如果您看到两次日志文件输出,那可能就是原因。