改进Wordcount中的身份映射器

时间:2016-08-21 13:30:57

标签: hadoop mapreduce yarn

我创建了一个map方法,用于读取wordcount示例[1]的地图输出。这个例子不再使用MapReduce提供的IdentityMapper,但这是我找到为Wordcount创建工作WordCountIdentityMapper的唯一方法。唯一的问题是这个Mapper花费的时间比我想要的多得多。我开始认为也许我正在做一些多余的事情。有任何帮助来改进我的public class WordCountIdentityMapper extends MyMapper<LongWritable, Text, Text, IntWritable> { private Text word = new Text(); public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); word.set(itr.nextToken()); Integer val = Integer.valueOf(itr.nextToken()); context.write(word, new IntWritable(val)); } public void run(Context context) throws IOException, InterruptedException { while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } } } 代码吗?

[1]身份映射器

public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());

        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }

    public void run(Context context) throws IOException, InterruptedException {
        try {
            while (context.nextKeyValue()) {
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } finally {
            cleanup(context);
        }
    }
}

[2]生成mapoutput

的Map类
def my_helper_method(string)
  text, link = string.partition(/(w{3}.youtube.com\/watch\?v=\w*)/)
  output = ""
  output += "<p>#{text}</p>" unless text.empty?
  output += "<iframe src='#{link}</iframe>'" unless link.empty?
  output
end

谢谢,

1 个答案:

答案 0 :(得分:0)

解决方案是用StringTokenizer方法替换indexOf()。它效果更好。我的表现更好。