Question

我设法以非分布式模式运行Hadoop wordcount示例;我在一个名为“part-00000”的文件中得到输出;我可以看到它列出了所有输入文件的所有单词。

跟踪wordcount代码后，我可以看到它需要行和基于空格分割单词。

我试图想出一种方法来列出多个文件中出现的单词及其出现次数？这可以通过Map / Reduce来实现吗？ -添加- 这些变化是否合适？

      //changes in the parameters here

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

         // These are the original line; I am not using them but left them here...
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

                    //My changes are here too

        private Text outvalue=new Text();
        FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
        private String filename = fileSplit.getPath().getName();;



      public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());

          //    And here        
              outvalue.set(filename);
          output.collect(word, outvalue);

        }

      }

    }

Answer 1

您可以修改映射器以将单词作为键输出，然后将Text作为表示单词来自的文件名的值。然后在您的reducer中，您只需要对文件名进行重复数据删除，并输出单词出现在多个文件中的条目。

获取正在处理的文件的文件名取决于您是否使用新API（mapred或mapreduce包名称）。我知道对于新API，您可以使用getInputSplit方法从Context对象中提取映射器输入拆分（然后可能是InputSplit到FileSplit，假设您正在使用{{ 1}}）。对于旧的API，我从未尝试过，但显然你可以使用名为TextInputFormat的配置属性

这也是引入组合器的好选择 - 从同一个映射器中删除多个单词出现次数。

<强>更新

因此，为了解决您的问题，您尝试使用名为reporter的实例变量，该变量在mapper的类scopt中不存在，修改如下：

map.input.file

（真的不确定为什么SO不尊重上面的格式......）

Wordcount文件的常用词

1 个答案: