如何使用WordCount MapReduce教程

时间:2016-11-28 17:30:04

标签: java hadoop mapreduce

我正在学习Hadoop MapReduce,我正在关注WordCount tutorial

在下面的代码中,我了解map方法,一次处理一行,由指定的TextInputFormat提供。然后它通过StringTokenizer将行拆分为由空格分隔的标记,并发出键值对[<word>, 1]

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
    }
}

如何编辑此代码以便每次读取一个句子而不是一行?

E.g。输入文字: This is my first sentence. This is the second sentence.

我想先阅读This is my first sentence.然后This is the second sentence.而不是Thisismyfirst,... < / p>

并输出:

1 This is my first sentence.
1 This is the second sentence.

因为句子This is my first sentence.在输入文本中只出现一次,而且This is the second sentence.句子在文本中出现一次。

假设输入文本是这样的:

This is my first sentence. This is my first sentence. This is the second sentence.

然后输出将是这样的:

2 This is my first sentence.
1 This is the second sentence.

因为句子This is my first sentence.在输入文字中出现两次,而句子This is the second sentence.在文字中只出现一次。

Fyi,WordCount的输出是:

2 This
2 is
1 my
1 first
2 sentence
1 second

因为术语This在输入文本中出现两次,所以术语is在文本中出现两次,术语my在文本等中出现一次。< / p>

解决方案:conf.set(&#34; textinputformat.record.delimiter&#34;,&#34;。&#34;):

作为分隔符,我设置了". "(带空格)。现在我的代码识别句子,但输出文件是错误的。使用以下输入文件:

This is my first sentence. This is my first sentence. This is the second sentence.

它生成的输出文件是这样的(一些空格,然后是数字3):

            3

而不是这样:

 2 This is my first sentence
 1 This is the second sentence

这是我的代码:

 public class SentenceCount {

      public static class SentenceMapper extends Mapper<Object, Text, Text, IntWritable>{

           private final static IntWritable one = new IntWritable(1);
           private Text word = new Text();

           public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
           //System.out.println("SENTENCE: " + value.toString());
           context.write(word, one);
     }
 }


 public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
      private IntWritable result = new IntWritable();

     public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
               sum += val.get();
       }
       result.set(sum);
       context.write(key, result);
     }
 }

 public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      conf.set("textinputformat.record.delimiter", ". ");
      Job job = Job.getInstance(conf, "sentence count");
      job.setJarByClass(SentenceCount.class);
      job.setMapperClass(SentenceMapper.class);
      job.setCombinerClass(IntSumReducer.class);
      job.setReducerClass(IntSumReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
 }      

我哪里错了?

2 个答案:

答案 0 :(得分:1)

最直接的解决方案是预处理您的输入并将每个句子放在一个新行中,并按原样继续使用TextInputFormat

另一种方法是,您可以覆盖TextInputFormat的默认分隔符(换行符:\n

您可以将分隔符更改为.,如下所示:

conf.set("textinputformat.record.delimiter", ".") - 在Driver类中。

(但要小心,如果句子中出现“。”字符(例如"This pen costs 1.55 dollars."),或者句子以感叹号而不是句号结尾,则会得到错误的结果。)< / p>

然后在您的map()方法中,您不再需要对该句子进行标记。

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
   context.write(value, one);
}

答案 1 :(得分:0)

除了在“空格”上进行标记之外,您还需要根据句子分隔符进行标记(在这种情况下为句点'。'。因此,使用RegEx可能会有所帮助。

另外,请记住一些极端情况。例如:您想如何对待以下内容?作为两句还是三句?

“这是我的第一句话。这是我的第二句话。”现在我有第三句话。

双引号部分是否被视为一句或两句(基于“或”)?