MapReduce中的HashPartition

时间:2015-11-09 15:10:29

标签: hadoop mapreduce hadoop-partitioning

目标:

实施HashPartition并检查自动创建的减少器的数量。

任何帮助和任何示例代码总是为此目的而受到赞赏。

我做了什么:

我使用在250MB csv文件上实现的Hash Partition运行了map reduce程序。 但我仍然看到hdfs只使用1个reducer来进行聚合。如果我已正确理解,hdfs应自动创建分区并均匀分布数据。然后n个reducer将对创建的n个分区起作用。但我没有看到这种情况发生。任何人都可以通过Hash Partitions帮助我实现这一目标。我不想定义分区的数量。

映射器代码:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {


        String[] line = value.toString().split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
        String airlineid = line[7];
        //int tailno = Integer.parseInt(line[10].replace("\"", ""));
        String tailno = line[9].replace("\"", "");

        if (tailno.length() != 0 ){
        //System.out.println(airlineid + " " + tailno + " " + tailno.length());
        context.write(new Text(airlineid), new Text(tailno));
        }


    }       

}

减速机代码:

public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {


        int count=0;

        for (Text value : values) {
        count ++;
        }

        //context.write(key, new IntWritable(maxValue));
        context.write(key, new IntWritable(count));

    }

分区代码:

public class FlightPartition extends Partitioner<Text, Text> {

     public int getPartition(Text key, Text value, int numReduceTasks) {
         return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
     }

}

驱动程序:

public class Flight
{

            public static void main (String[] args) throws Exception
            {

                Configuration conf = new Configuration();
                Job job = Job.getInstance(conf, "Flight");
                job.setJarByClass(Flight.class);

                job.setMapperClass(FlightMapper.class);
                job.setReducerClass(FlightReducer.class);
                job.setPartitionerClass(FlightPartition.class);

                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(Text.class);
                FileInputFormat.addInputPath(job, new Path(args[0]));
                FileOutputFormat.setOutputPath(job, new Path(args[1]));

                System.exit(job.waitForCompletion(true) ? 0 : 1);

             }
}

记录:

15/11/09 06:14:14 INFO mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=7008211
        FILE: Number of bytes written=14438683
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=211682444
        HDFS: Number of bytes written=178
        HDFS: Number of read operations=12
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Killed map tasks=2
        Launched map tasks=5
        Launched reduce tasks=1
        Data-local map tasks=5
        Total time spent by all maps in occupied slots (ms)=2235296
        Total time spent by all reduces in occupied slots (ms)=606517
        Total time spent by all map tasks (ms)=2235296
        Total time spent by all reduce tasks (ms)=606517
        Total vcore-seconds taken by all map tasks=2235296
        Total vcore-seconds taken by all reduce tasks=606517
        Total megabyte-seconds taken by all map tasks=2288943104
        Total megabyte-seconds taken by all reduce tasks=621073408
    Map-Reduce Framework
        Map input records=470068
        Map output records=467281
        Map output bytes=6073643
        Map output materialized bytes=7008223
        Input split bytes=411
        Combine input records=0
        Combine output records=0
        Reduce input groups=15
        Reduce shuffle bytes=7008223
        Reduce input records=467281
        Reduce output records=15
        Spilled Records=934562
        Shuffled Maps =3
        Failed Shuffles=0
        Merged Map outputs=3
        GC time elapsed (ms)=3701
        CPU time spent (ms)=277080
        Physical memory (bytes) snapshot=590581760
        Virtual memory (bytes) snapshot=3196801024
        Total committed heap usage (bytes)=441397248
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=211682033
    File Output Format Counters 
        Bytes Written=178

1 个答案:

答案 0 :(得分:0)

检查您的mapred-default.xml文件并查找

mapreduce.job.reduces财产。将值更改为&gt; 1表示群集中的更多减速器。如果mapreduce.jobtracker.address是&#34; local&#34;。

,则会忽略此属性。

您可以使用

覆盖java中的默认属性
job.setNumReduceTasks(3)

请查看此article以获取Apache的mapred-default.xml的完整列表。

减少了多少?(来自Apache)

  

正确的减少数量似乎是0.95或1.75乘以(*)。

     

使用0.95时,所有缩减都可以立即启动,并在地图完成后开始传输地图输出。使用1.75,更快的节点将完成第一轮减少并启动第二波减少,从而更好地实现负载平衡。

     

增加减少的数量会增加框架开销,但会增加负载平衡并降低故障成本。

有多少张地图?

  

地图数量通常由输入的总大小驱动,即输入文件的总块数。

     

地图的正确并行度似乎是每个节点大约10-100个地图,尽管已经为非常cpu-light地图任务设置了300个地图。任务设置需要一段时间,因此最好是地图至少需要一分钟才能执行。

     

因此,如果您期望10TB的输入数据并且块大小为128MB,那么除非Configuration.set(MRJobConfig.NUM_MAPS,int)(仅提供框架提示),否则您最终会得到82,000个映射。过去常常把它设得更高。

查看Apache Map Reduce Tutorial