Hadoop将数据从映射器减少到组合器

时间:2016-02-12 08:30:38

标签: java hadoop

我有一个文本输入文件,包含一个URL +一个变量ammount的关键字。这看起来像是:

  1. facebook.com社交新闻朋友
  2. msn.com新闻邮件
  3. yahoo.com财经新闻
  4. 我需要将其转换为输出,例如:

    1. social facebook.com
    2. news facebook.com msn.com yahoo.com
    3. 朋友facebook.com
    4. 资助yahoo.com
    5. 我的mapper类看起来像这样:

      public class KeywordsMapper extends Mapper<LongWritable, Text, Text, Text> {
      private Text urlkey = new Text();
      @Override
      protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
          String[] line = value.toString().split(" ");
          ArrayList<String> keywords = new ArrayList<String>();
          for (String sequence : line) {
              if (sequence.endsWith(".com")) {
                  // url
                  urlkey.set(sequence);
              } else {
                  // keyword
                  keywords.add(sequence);
              }
          }
          for (String keyword : keywords) {
              context.write(new Text(keyword), urlkey);
          }
      }
      }
      

      我的reducer / combiner类看起来像这样:

      public class KeywordReducer extends Reducer<Text, Iterable<Text>, Text, Text> {
      public void reduce(Text key,  Iterable<Text> values, Context context) throws IOException, InterruptedException {
          String body = "";
          for(Text part : values){
              body = body + " " + part.toString() + " ";
          }
          context.write(key, new Text(body));
      }
      }
      

      这份工作看起来像这样:

      public class KeywordJob extends Configured implements Tool{
      
      @Override
      public int run(String[] arg0) throws Exception {
          Job job = new Job(getConf());
          job.setJarByClass(getClass());
          job.setJobName(getClass().getSimpleName());
      
          FileInputFormat.addInputPath(job, new Path(arg0[0]));
          FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
      
          job.setMapperClass(KeywordsMapper.class);
          job.setCombinerClass(KeywordReducer.class);
          job.setReducerClass(KeywordReducer.class);
      
      
          job.setOutputKeyClass(Text.class);
          job.setOutputValueClass(Text.class);
      
          return job.waitForCompletion(true) ? 0 : 1;
      }
      
      public static void main(String[]args) throws Exception{
          int rc = ToolRunner.run(new KeywordJob(), args);
          System.exit(rc);
      
      }
      
      }
      

      我目前获得的输出是:

      output

      输入文件为:

      yahoo.com news sports finance email celebrity
      amazon.com shoes books jeans
      google.com news finance email search
      microsoft.com operating-system productivity search
      target.com shoes books jeans groceries
      wegmans.com books groceries
      facebook.com news social sports
      linkedin.com news recruitment
      

      问题:如何调整我的合成器/减速器以获得所需的输出?是否有一个特定的原因,为什么输出包含多个重复键,以及它们未被合并的结果?

1 个答案:

答案 0 :(得分:2)

标记,

没有调用/调用reducer。

reducer类定义应该看起来像 -

onPause

而不是

public class KeywordReducer extends Reducer<Text, Text, Text, Text> 

因为地图输出应与此对应。 reduce()方法签名是正确的。

希望这有帮助。