Question

我需要使用mapreduce处理一个巨大的文件，我需要离开让最终用户选择他们想要处理的记录数。

问题在于没有任何有效的方法来处理文件的子集而没有＆＃34;映射＆＃34;整个文件（25tb文件）

有没有办法在特定数量的记录后停止映射并继续使用reduce部分？

Answer 1

这个问题有一个非常简单而优雅的解决方案：覆盖run()类org.apache.hadoop.mapreduce.Mapper并仅执行map()直到您需要或仅执行您需要/想要的记录。

请参阅以下内容：

public static class MapJob extends Mapper<LongWritable, Text, Text, Text> {

    private Text outputKey = new Text();
    private Text outputValue = new Text();
    private int numberOfRecordsToProcess;

    // read numberOfRecordsToProcess in setup method from the configuration values set in the driver class after getting input from user

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
     // Do your map thing
    }

    @Override
    public void run(Context context) throws IOException, InterruptedException {

        setup(context);
        int count = 0 ;
        while (context.nextKeyValue()) {
            if(count++<numberOfRecordsToProcess){ // check if enough records has been processed already
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }else{
                break;
            }
        }
    }

    cleanup(context);
}

Answer 2

How to create output files with fixed number of lines in hadoop/map reduce?，您可以使用此链接中的信息将N行作为映射器输入运行，并仅从主类运行一个映射器

setNumMapTasks(int)

处理mapreduce中文件的子集

2 个答案: