WordCount MapReduce正在给出意想不到的结果

时间:2013-04-22 10:57:21

标签: java hadoop mapreduce word-count

我在mapreduce中尝试使用wordcount的java代码,在完成reduce方法后,我想显示最多次出现的唯一单词。

为此我创建了一些名为myoutput,mykey和completeSum的类级变量。

我用close方法编写这些数据,但最后我得到了意想不到的结果。

public class WordCount {

public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);

        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
        }

    }
}

static int completeSum = -1;
static OutputCollector<Text, IntWritable> myoutput;
static Text mykey = new Text();

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }

        if (completeSum < sum) {
            completeSum = sum;
            myoutput = output;
            mykey = key;
        }


    }

    @Override
    public void close() throws IOException {
        // TODO Auto-generated method stub
        super.close();
        myoutput.collect(mykey, new IntWritable(completeSum));
    }
}

public static void main(String[] args) throws Exception {

    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    // conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);

}
}

输入文件数据

one 
three three three
four four four four 
 six six six six six six six six six six six six six six six six six six 
five five five five five 
seven seven seven seven seven seven seven seven seven seven seven seven seven 

结果应为

six 18

但是我得到了这个结果

three 18

结果我可以看到总和是正确的,但关键不是。

如果有人可以对这些地图提供良好的参考并减少方法,那将非常有帮助。

1 个答案:

答案 0 :(得分:1)

您正在观察的问题是由于引用别名。 key引用的对象将重新使用多个调用的新内容,从而更改引用同一对象的mykey。它以最后一个减少的键结束。复制对象可以避免这种情况,如:

mykey = new Text(key);

但是,您应该仅从输出文件获取结果,因为static变量不能由分布式群集中的不同节点共享。它只适用于独立模式,无法实现map-reduce的目的。

最后,即使在独立模式下使用全局变量,如果使用并行本地任务,也会大多数情况下导致竞争(参见MAPREDUCE-1367MAPREDUCE-434)。