Hadoop Word Count工作但不总结单词

时间:2014-01-04 18:35:18

标签: hadoop mapreduce word-count

我正在使用Hadoop 1.2.1,出于某种原因,我的Word Count输出看起来很奇怪:

输入文件:

this is sparta this was sparta hello world goodbye world

输出hdfs:

goodbye 1
hello   1
is  1
sparta  1
sparta  1
this    1
this    1
was 1
world   1
world   1

public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
} 

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        context.write(key, new IntWritable(sum));
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setJarByClass(WordCount.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}

}

这是一些相关的控制台输出:

14/01/04 16:17:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/01/04 16:17:37 INFO input.FileInputFormat: Total input paths to process : 1
14/01/04 16:17:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/04 16:17:37 WARN snappy.LoadSnappy: Snappy native library not loaded
14/01/04 16:17:38 INFO mapred.JobClient: Running job: job_201401041506_0013
14/01/04 16:17:39 INFO mapred.JobClient:  map 0% reduce 0%
14/01/04 16:17:45 INFO mapred.JobClient:  map 100% reduce 0%
14/01/04 16:17:52 INFO mapred.JobClient:  map 100% reduce 33%
14/01/04 16:17:54 INFO mapred.JobClient:  map 100% reduce 100%
14/01/04 16:17:55 INFO mapred.JobClient: Job complete: job_201401041506_0013
14/01/04 16:17:55 INFO mapred.JobClient: Counters: 26
14/01/04 16:17:55 INFO mapred.JobClient:   Job Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=6007
14/01/04 16:17:55 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient:     Launched map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     Data-local map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=9167
14/01/04 16:17:55 INFO mapred.JobClient:   File Output Format Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Bytes Written=77
14/01/04 16:17:55 INFO mapred.JobClient:   FileSystemCounters
14/01/04 16:17:55 INFO mapred.JobClient:     FILE_BYTES_READ=123
14/01/04 16:17:55 INFO mapred.JobClient:     HDFS_BYTES_READ=169
14/01/04 16:17:55 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=122037
14/01/04 16:17:55 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=77
14/01/04 16:17:55 INFO mapred.JobClient:   File Input Format Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Bytes Read=57
14/01/04 16:17:55 INFO mapred.JobClient:   Map-Reduce Framework
14/01/04 16:17:55 INFO mapred.JobClient:     Map output materialized bytes=123
14/01/04 16:17:55 INFO mapred.JobClient:     Map input records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce shuffle bytes=123
14/01/04 16:17:55 INFO mapred.JobClient:     Spilled Records=20
14/01/04 16:17:55 INFO mapred.JobClient:     Map output bytes=97
14/01/04 16:17:55 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
14/01/04 16:17:55 INFO mapred.JobClient:     Combine input records=0
14/01/04 16:17:55 INFO mapred.JobClient:     SPLIT_RAW_BYTES=112
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce input records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce input groups=7
14/01/04 16:17:55 INFO mapred.JobClient:     Combine output records=0
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce output records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Map output records=10

这会导致什么?我对Hadoop很新,所以我不知道在哪里看。 谢谢!

1 个答案:

答案 0 :(得分:2)

您使用的是旧的API签名。在1.x +中,reduce方法更改为使用iterables而不是迭代器(这是旧的0.x API使用的,因此您将在书籍和Web上的许多示例中看​​到迭代器。)

http://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapreduce/Reducer.html#reduce%28KEYIN,%20java.lang.Iterable,%20org.apache.hadoop.mapreduce.Reducer.Context%29

尝试

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    context.write(key, new IntWritable(sum));
}

@Override注释告诉编译器检查你的reduce方法是否覆盖了父类中正确的方法签名。