MapReduce查找字长频率

时间:2014-10-24 22:03:36

标签: java hadoop mapreduce

我是MapReduce的新手,我想问一下是否有人可以通过MapReduce给我一个创建字长频率的想法。我已经有了字数的代码,但我想使用字长,这是我到目前为止所做的。

public class WordCount  {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }
}

}

谢谢...

1 个答案:

答案 0 :(得分:2)

对于字长频率,tokenizer.nextToken()不应作为key发出。实际上要考虑该字符串的长度。因此,只需进行以下更改,您的代码就可以正常运行,并且足够了:

word.set( String.valueOf( tokenizer.nextToken().length() ));  

现在,如果你深入了解一下,你会发现Mapper输出键不应该是Text,尽管它有效。更好地使用IntWritable密钥:

public static class Map extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private IntWritable wordLength = new IntWritable();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);
        }
    }
}

虽然大多数MapReduce示例都使用StringTokenizer,但使用String.split方法更清晰,更明智。因此,相应地进行更改。