Question

我被要求修改WordCount示例，以便每个映射器函数在传递之前将其文件中出现的单词相加。例如，而不是：

<help,2>
<you,1>
<me,1>

映射器的输出为：

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    context.write(word, one);
}

那么我会将这个词添加到数组中，然后检查是否发生了？或者有更简单的方法吗？

{{1}}

Answer 1

您可以定义Java Map结构或Guava Multiset，并计算每个Mapper的每个单词的出现次数。然后，当映射器结束时，之后运行的清理方法可以将所有部分和作为map的输出发出，就像那样（伪代码）：

setup() {
    Map<String,Integer> counts = new HashMap<>(); 
}

map() {
    for each word w {
        counts.put(w, counts.get(w)+1); //or 1, if counts.get(w) returns null
    }
}

cleanup() {
    for each key w of counts.keySet {
        context.write(w, counts.get(w));
    }
}

引用Mapper's documentation（版本2.6.2）：

Hadoop Map-Reduce框架为作业的InputFormat生成的每个InputSplit生成一个map任务。 Mapper实现可以通过JobContext.getConfiguration（）访问作业的配置。

框架首先调用setup（org.apache.hadoop.mapreduce.Mapper.Context），然后调用InputSplit中每个键/值对的map（Object，Object，Context）。最后调用cleanup（Context）。

除此之外，您还可以考虑使用Combiner作为替代方案。

Hadoop WordCount，在地图上总和

1 个答案: