点：使用`org.apache.hadoop.mapreduce`代替

Question

使用MapReduce，如何修改以下字数代码，使其仅输出超过特定计数阈值的字数？（例如，我想添加一些键值对的过滤。）

输入：

ant bee cat
bee cat dog
cat dog

输出：假设计数阈值为2或更多

cat 3
dog 2

以下代码来自：http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code

public static class Map1 extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

public static class Reduce1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

编辑：RE：关于输入/测试用例

输入文件（＆＃34; example.dat＆＃34;）和一个简单的测试用例（＆＃34; testcase＆＃34;）可在此处找到：https://github.com/csiu/tokens/tree/master/other/SO-26695749

编辑：

问题不在于代码。这是由于org.apache.hadoop.mapred包之间的一些奇怪行为。（Is it better to use the mapred or the mapreduce package to create a Hadoop Job?）。

点：使用`org.apache.hadoop.mapreduce`代替

Answer 1

在收集reduce中的输出之前尝试添加if语句。

if(sum >= 2)
    output.collect(key, new IntWritable(sum));

Answer 2

您可以在Reduce1类中进行过滤：

if (sum>=2) {
    output.collect(key. new IntWritable(sum));
}

MapReduce：如果值不高于阈值，则过滤掉键值对

点：使用`org.apache.hadoop.mapreduce`代替

2 个答案:

MapReduce：如果值不高于阈值，则过滤掉键值对

点：使用org.apache.hadoop.mapreduce代替

2 个答案:

点：使用`org.apache.hadoop.mapreduce`代替