Question

我正在学习Hadoop MapReduce，我正在关注WordCount tutorial。

在下面的代码中，我了解map方法，一次处理一行，由指定的TextInputFormat提供。然后它通过StringTokenizer将行拆分为由空格分隔的标记，并发出键值对[<word>, 1]：

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
    }
}

如何编辑此代码以便每次读取一个句子而不是一行？

E.g。输入文字： This is my first sentence. This is the second sentence.

我想先阅读This is my first sentence.然后This is the second sentence.而不是This，is，my，first，... < / p>

并输出：

1 This is my first sentence.
1 This is the second sentence.

因为句子This is my first sentence.在输入文本中只出现一次，而且This is the second sentence.句子在文本中出现一次。

假设输入文本是这样的：

This is my first sentence. This is my first sentence. This is the second sentence.

然后输出将是这样的：

2 This is my first sentence.
1 This is the second sentence.

因为句子This is my first sentence.在输入文字中出现两次，而句子This is the second sentence.在文字中只出现一次。

Fyi，WordCount的输出是：

2 This
2 is
1 my
1 first
2 sentence
1 second

因为术语This在输入文本中出现两次，所以术语is在文本中出现两次，术语my在文本等中出现一次。< / p>

解决方案：conf.set（＆＃34; textinputformat.record.delimiter＆＃34;，＆＃34;。＆＃34;）：

作为分隔符，我设置了". "（带空格）。现在我的代码识别句子，但输出文件是错误的。使用以下输入文件：

This is my first sentence. This is my first sentence. This is the second sentence.

它生成的输出文件是这样的（一些空格，然后是数字3）：

而不是这样：

 2 This is my first sentence
 1 This is the second sentence

这是我的代码：

 public class SentenceCount {

      public static class SentenceMapper extends Mapper<Object, Text, Text, IntWritable>{

           private final static IntWritable one = new IntWritable(1);
           private Text word = new Text();

           public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
           //System.out.println("SENTENCE: " + value.toString());
           context.write(word, one);
     }
 }


 public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
      private IntWritable result = new IntWritable();

     public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
               sum += val.get();
       }
       result.set(sum);
       context.write(key, result);
     }
 }

 public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      conf.set("textinputformat.record.delimiter", ". ");
      Job job = Job.getInstance(conf, "sentence count");
      job.setJarByClass(SentenceCount.class);
      job.setMapperClass(SentenceMapper.class);
      job.setCombinerClass(IntSumReducer.class);
      job.setReducerClass(IntSumReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
 }

我哪里错了？

Answer 1

最直接的解决方案是预处理您的输入并将每个句子放在一个新行中，并按原样继续使用TextInputFormat。

另一种方法是，您可以覆盖TextInputFormat的默认分隔符（换行符：\n）

您可以将分隔符更改为.，如下所示：

conf.set("textinputformat.record.delimiter", ".") - 在Driver类中。

（但要小心，如果句子中出现“。”字符（例如"This pen costs 1.55 dollars."），或者句子以感叹号而不是句号结尾，则会得到错误的结果。）< / p>

然后在您的map()方法中，您不再需要对该句子进行标记。

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
   context.write(value, one);
}

Answer 2

除了在“空格”上进行标记之外，您还需要根据句子分隔符进行标记（在这种情况下为句点'。'。因此，使用RegEx可能会有所帮助。

另外，请记住一些极端情况。例如：您想如何对待以下内容？作为两句还是三句？

“这是我的第一句话。这是我的第二句话。”现在我有第三句话。

双引号部分是否被视为一句或两句（基于“或”）？

如何使用WordCount MapReduce教程

2 个答案: