使用MapReduce删除包含特定单词的整个句子

时间:2016-11-29 15:39:28

标签: java hadoop mapreduce

我正在学习MapReduce并且我想读取一个输入文件(逐句)并且只有在不包含单词“snake”的情况下才将每个句子写入输出文件。

E.g。输入文件:

This is my first sentence. This is my first sentence.
This is my first sentence.

The snake is an animal. This is the second sentence. This is my third sentence.

Another sentence. Another sentence with snake.

然后输出文件应为:

This is my first sentence. This is my first sentence.
This is my first sentence.

This is the second sentence. This is my third sentence.

Another sentence.

为此,我在map方法中检查句子(value)是否包含单词snake。如果句子不包含蛇词,那么我在context中写下该句子。

另外,我将reducer任务的数量设置为0,否则在输出文件中我以随机顺序得到句子(例如第一句,然后是第三句,然后是第二句,依此类推)。

我的代码使用蛇词正确过滤了句子,但问题是它将每个句子写成一个新行,如下所示:

This is my first sentence. 
 This is my first sentence. 

This is my first sentence. 
 This is the second sentence. 
 This is my third sentence. 


Another sentence. 

. 

只有当该句子出现在输入文本的新行中时,如何才能在新行中编写句子?以下是我的代码:

public class RemoveSentence {

    public static class SentenceMapper extends Mapper<Object, Text, Text, NullWritable>{

        private Text removeWord = new Text ("snake");

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            if (!value.toString().contains(removeWord.toString())) {
                Text currentSentence = new Text(value.toString()+". ");
                context.write(currentSentence, NullWritable.get());
            }
        }
    }


    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("textinputformat.record.delimiter", ".");

        Job job = Job.getInstance(conf, "remove sentence");
        job.setJarByClass(RemoveSentence.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setMapperClass(SentenceMapper.class);
        job.setNumReduceTasks(0);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Thisthis other解决方案说应该足以设置context.write(word, null);,但在我的情况下不起作用。

另一个问题与conf.set("textinputformat.record.delimiter", ".");有关。好吧,这就是我如何定义句子之间的分隔符,因此有时输出文件中的句子以空格开头(例如第二个{​​{1}})。作为替代方案,我尝试将其设置为此This is my first sentence.(在完全停止后有空格),但这样Java应用程序不会在输出文件中写入所有句子。

1 个答案:

答案 0 :(得分:0)

你非常接近解决问题。想想你的MapReduce程序是如何工作的。你的map方法将每个句子用“。”分隔。 (如您所知,默认为换行符)作为新值,然后将其写入文件。您需要一个属性,在每次map()调用后禁用写入换行符。我不确定,但我不认为这样的财产存在。

一种解决方法是让它正常处理。示例记录将是:

This is first sentence. This is second snake. This is last.

找到单词“snake”,如果找到,请在上一个“。”之后立即删除所有内容。到下一个 ”。”打包新String并将其写入上下文。

当然,如果你能找到一种方法在map()调用之后禁用换行符,那么这将是最简单的。

希望这有帮助。