我是MapReduce的新手,我想问一下是否有人可以通过MapReduce给我一个创建字长频率的想法。我已经有了字数的代码,但我想使用字长,这是我到目前为止所做的。
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
谢谢...
答案 0 :(得分:2)
对于字长频率,tokenizer.nextToken()
不应作为key
发出。实际上要考虑该字符串的长度。因此,只需进行以下更改,您的代码就可以正常运行,并且足够了:
word.set( String.valueOf( tokenizer.nextToken().length() ));
现在,如果你深入了解一下,你会发现Mapper
输出键不应该是Text
,尽管它有效。更好地使用IntWritable
密钥:
public static class Map extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private IntWritable wordLength = new IntWritable();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
wordLength.set(tokenizer.nextToken().length());
context.write(wordLength, one);
}
}
}
虽然大多数MapReduce
示例都使用StringTokenizer
,但使用String.split
方法更清晰,更明智。因此,相应地进行更改。