Question

我使用case来编写Map Reducing代码，我必须将对应于同一队列的值分组：

INPUT：

A，B
A，C
B，A
B，D

输出：

A {B，C}
B {A，D}

我写了以下内容：

   

        import java.io.IOException;
        import java.util.StringTokenizer;

        /*
        * All org.apache.hadoop packaged can be imported using the jar present in lib 
        * directory of this java project.
        */

        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.LongWritable;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.Job;
        import org.apache.hadoop.mapreduce.Mapper;
        import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
        import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
        import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

        public class GroupKeyValues {

        public static class Map extends Mapper {

            public void map(LongWritable key, Text value, Context con)
                    throws IOException, InterruptedException {

                Text myKey = new Text();
                Text myVal = new Text();
                String line = value.toString();
                StringTokenizer st = new StringTokenizer(line);

                while (st.hasMoreTokens()) {

                    String thisH = st.nextToken();
                    String[] splitData = thisH.split(",");
                    myKey.set(splitData[0]);
                    myVal.set(splitData[1]);
                }
                con.write(myKey, myVal);

            }

        }

        @SuppressWarnings("deprecation")
        public static void main(String[] args) throws Exception {

            Configuration conf = new Configuration();
            @SuppressWarnings("deprecation")
            Job job = new Job(conf, "GroupKeyValues");

            job.setJarByClass(GroupKeyValues.class);
            job.setMapperClass(Map.class);

            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);

            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            Path outputPath = new Path(args[1]);

            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));

            outputPath.getFileSystem(conf).delete(outputPath);

            System.exit(job.waitForCompletion(true) ? 0 : 1);

        }
    }

Answer 1

您缺少将值合并为单个“行”值的reducer。例如，您可以像这样使用ArrayWritable：

  public static class AggregatingReducer extends Reducer<Text, Text, Text, ArrayWritable> {
    private ArrayWritable result = new ArrayWritable(Text.class);

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
        InterruptedException {
      List<Text> list = new ArrayList<>();
      for (Text value : values) {
        list.add(value);
      }
      result.set(list.toArray(new Text[list.size()]));
      context.write(key, result);
    }
  }

在作业设置中，请务必添加以下内容：

job.setReducerClass(AggregatingReducer.class);
job.setOutputValueClass(ArrayWritable.class);  //instead of Text.class

或者（取决于你需要的）你可以将reducer值连接到StringBuilder并发出Text而不是将它累积到ArrayWritable并作为ArrayWritable发出。

<强>更新以下是StringBuilder与逗号分隔符一起使用的示例：

  public static class AggregatingReducer extends Reducer<Text, Text, Text, Text> {
    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
        InterruptedException {
      StringBuilder sb = new StringBuilder();
      for (Text value : values) {
        if (sb.length() != 0) {
          sb.append(',');
        }
        sb.append(value);
      }
      result.set(sb.toString());
      context.write(key, result);
    }
  }

在驱动程序值类型中需要更改回Text：

job.setOutputValueClass(Text.class);

Answer 2

您是否考虑使用Apache Spark解决此问题？代码可能看起来像这样

import org.apache.spark.sql.functions._
val df = sqlContext.createDataFrame(Seq(("A","B"),("A","C"),("B","A"),("B","D")))
val dfAgg = df.groupBy("_1").agg(collect_list("_2"))
dfAgg.show()

对应的密钥和值进行分组

2 个答案: