运行顺序作业Hadoop

时间:2014-04-11 12:26:44

标签: java hadoop mapreduce

我正在使用ChainMapper,它发生的是它们以流水线方式运行但我想要的是每个映射器等待前一个映射器完全完成它的工作。

让我们取计数单词示例,其中firstmapper将行分为单词,secondmapper计算大写单词,thirdmapper计算小写单词。

现在,以这种方式运行,即使第一个映射器还没有读完整个文件,第二个映射器也可以开始计算单词。

我想要的是强制第二个映射器等到第一个映像器完全读完输入文件然后它就可以启动了。

我目前的配置如下:

JobConf conf = new JobConf(getConf(), ChainDriver.class);
        conf.setJobName("wordcount");

      ....

       //first mapper
        JobConf mapAConf = new JobConf(false);
        ChainMapper.addMapper(conf, TokenizerMapper.class, LongWritable.class,
                Text.class, Text.class, IntWritable.class, true, mapAConf);

        //secondmapper
        JobConf mapBConf = new JobConf(false);
        ChainMapper.addMapper(conf, UpperCaserMapper.class, Text.class,
                IntWritable.class, Text.class, IntWritable.class, true,
                mapBConf);
      .....

        JobConf reduceConf = new JobConf(false);
        ChainReducer.setReducer(conf, WordCountReducer.class, Text.class,
                IntWritable.class, Text.class, IntWritable.class, true,
                reduceConf);
   JobClient.runJob(conf);
        return 0;
    }

有没有办法强制一个接一个地顺序运行?

1 个答案:

答案 0 :(得分:0)

Chainmapper确实以顺序方式运行。 阅读此行:

The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

您可以在此处找到它:http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapred/lib/ChainMapper.html