Question

我的问题是如何在mapreduce程序中对mapper的输出进行排序（ps：没有reducers（0）），我只使用map侧来过滤两个输入，我想要结果（输出） mappers）将按每个映射器的每个键进行排序。如何在不使用其他工作的情况下在同一工作中进行此类操作？请你的建议

Answer 1

您可以通过将所有预期结果收集到Mapper上的本地/内存中数据结构来实现部分（每个Mapper）排序。然后，您将对其进行排序，最后为现在排序的集合中的所有元素运行collector.write。

所以这里与vanilla行为的区别在于，在后一种情况下，每个元素只是在遇到时发出 - 导致随机/无序输出。

请注意，结果仍然没有total ordering：这将需要一个Reducer步骤。

Answer 2

一个目录中的三个文件 - 在一个作业中按一个映射器排序。运行为 hadoop jar sort.jar sort file:///path/sortFiles/ sortedFiles

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class sort{
    public static class sortMapper extends Mapper<Object, Text, Text, Text> {
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            //add filter logic here
            context.write(new Text(value), new Text(""));
        }
    }

    public static void main(String[] args) throws Exception {

          if(args.length != 2)
          {
              System.out.println("missing agrs: usage <prog> <arg1> <arg2>");
              System.exit(1);
          }
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "sort mutilple files");
        job.setJarByClass(sort.class);
        job.setMapperClass(sortMapper.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1])); 

        job.waitForCompletion(true);
      }
}

如何在mapreduce中对map端程序的输出进行排序？

2 个答案: