如何覆盖Hadoop的默认排序

时间:2015-03-06 12:28:43

标签: java hadoop mapreduce

我有一个 map-reduce 作业,其中键是1-200的数字。我的预期输出数字顺序为(数字,值)。 但我得到的输出为:

1    value
10   value
11   value
   :
   : 
2    value
20   value
   :
   :
3    value

我知道这是由于Map-Reduce按升序排序键的默认行为。

我希望我的密钥只按数字顺序排序。我怎样才能做到这一点?

2 个答案:

答案 0 :(得分:3)

如果我不得不猜测,我会说你将数字存储为Text对象而不是IntWritable对象。

无论哪种方式,一旦你有多个reducer,只有reducer中的项目会被排序,但它不会被完全排序。

答案 1 :(得分:1)

如果密钥为WritableComparator,MapReduce框架中的默认IntWritable通常会处理您的数字排序。我怀疑它得到一个Text密钥,从而导致你的案例中的词典排序。请查看使用IntWritable键发出值的示例代码:

1)Mapper Implementaion

package com.stackoverflow.answers.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class SourceFileMapper extends Mapper<LongWritable, Text, IntWritable, Text> {

    private static final String DEFAULT_DELIMITER = "\t";

    private IntWritable keyToEmit = new IntWritable();
    private Text valueToEmit = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        keyToEmit.set(Integer.parseInt(line.split(DEFAULT_DELIMITER)[0]));
        valueToEmit.set(line.split(DEFAULT_DELIMITER)[1]);
        context.write(keyToEmit, valueToEmit);
    }

}

2)减速器实施

package com.stackoverflow.answers.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SourceFileReducer extends Reducer<IntWritable, Text, IntWritable, Text> {

    public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException,
            InterruptedException {
        for (Text value : values) {
            context.write(key, value);
        }
    }

}

3)驱动程序实施

package com.stackoverflow.answers.mapreduce;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class SourceFileDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        Path inputPath = new Path(args[0]);
        Path outputDir = new Path(args[1]);

        // Create configuration
        Configuration conf = new Configuration(true);

        // Create job
        Job job = new Job(conf, "SourceFileDriver");
        job.setJarByClass(SourceFileDriver.class);

        // Setup MapReduce
        job.setMapperClass(SourceFileMapper.class);
        job.setReducerClass(SourceFileReducer.class);
        job.setNumReduceTasks(1);

        // Specify key / value
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(Text.class);

        // Input
        FileInputFormat.addInputPath(job, inputPath);
        job.setInputFormatClass(TextInputFormat.class);

        // Output
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setOutputFormatClass(TextOutputFormat.class);

        // Delete output if exists
        FileSystem hdfs = FileSystem.get(conf);
        if (hdfs.exists(outputDir))
            hdfs.delete(outputDir, true);

        // Execute job
        int code = job.waitForCompletion(true) ? 0 : 1;
        System.exit(code);

    }

}

谢谢!