我正在学习Hadoop MapReduce,我正在关注WordCount tutorial。
在下面的代码中,我了解map
方法,一次处理一行,由指定的TextInputFormat
提供。然后它通过StringTokenizer
将行拆分为由空格分隔的标记,并发出键值对[<word>, 1]
:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
如何编辑此代码以便每次读取一个句子而不是一行?
E.g。输入文字:
This is my first sentence. This is the second sentence.
我想先阅读This is my first sentence.
然后This is the second sentence.
而不是This
,is
,my
,first
,... < / p>
并输出:
1 This is my first sentence.
1 This is the second sentence.
因为句子This is my first sentence.
在输入文本中只出现一次,而且This is the second sentence.
句子在文本中出现一次。
假设输入文本是这样的:
This is my first sentence. This is my first sentence. This is the second sentence.
然后输出将是这样的:
2 This is my first sentence.
1 This is the second sentence.
因为句子This is my first sentence.
在输入文字中出现两次,而句子This is the second sentence.
在文字中只出现一次。
Fyi,WordCount的输出是:
2 This
2 is
1 my
1 first
2 sentence
1 second
因为术语This
在输入文本中出现两次,所以术语is
在文本中出现两次,术语my
在文本等中出现一次。< / p>
解决方案:conf.set(&#34; textinputformat.record.delimiter&#34;,&#34;。&#34;):
作为分隔符,我设置了". "
(带空格)。现在我的代码识别句子,但输出文件是错误的。使用以下输入文件:
This is my first sentence. This is my first sentence. This is the second sentence.
它生成的输出文件是这样的(一些空格,然后是数字3):
3
而不是这样:
2 This is my first sentence
1 This is the second sentence
这是我的代码:
public class SentenceCount {
public static class SentenceMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
//System.out.println("SENTENCE: " + value.toString());
context.write(word, one);
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("textinputformat.record.delimiter", ". ");
Job job = Job.getInstance(conf, "sentence count");
job.setJarByClass(SentenceCount.class);
job.setMapperClass(SentenceMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
我哪里错了?
答案 0 :(得分:1)
最直接的解决方案是预处理您的输入并将每个句子放在一个新行中,并按原样继续使用TextInputFormat
。
另一种方法是,您可以覆盖TextInputFormat
的默认分隔符(换行符:\n
)
您可以将分隔符更改为.
,如下所示:
conf.set("textinputformat.record.delimiter", ".")
- 在Driver类中。
(但要小心,如果句子中出现“。”字符(例如"This pen costs 1.55 dollars."
),或者句子以感叹号而不是句号结尾,则会得到错误的结果。)< / p>
然后在您的map()
方法中,您不再需要对该句子进行标记。
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(value, one);
}
答案 1 :(得分:0)
除了在“空格”上进行标记之外,您还需要根据句子分隔符进行标记(在这种情况下为句点'。'。因此,使用RegEx可能会有所帮助。
另外,请记住一些极端情况。例如:您想如何对待以下内容?作为两句还是三句?
“这是我的第一句话。这是我的第二句话。”现在我有第三句话。
双引号部分是否被视为一句或两句(基于“或”)?