设置输入分割不起作用的映射器的Hadoop数

时间:2016-11-19 05:37:31

标签: java hadoop mapreduce hdfs mapper

我正在尝试使用不同数量的mapper和reducer多次运行hadoop作业。我已经设置了配置:

  
      
  • mapreduce.input.fileinputformat.split.maxsize
  •   
  • mapreduce.input.fileinputformat.split.minsize
  •   
  • mapreduce.job.maps
  •   

我的文件大小是1160421275,当我尝试在此代码中使用4个映射器和3个reducer配置它时:

Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
long size = hdfs.getContentSummary(new Path("input/filea").getLength();
size+=hdfs.getContentSummary(new Path("input/fileb").getLength();
conf.set("mapreduce.input.fileinputformat.split.maxsize", String.valueOf((size/4)));
conf.set("mapreduce.input.fileinputformat.split.minsize", String.valueOf((size/4)));
conf.set("mapreduce.job.maps",4);
....
job.setNumReduceTask(3);

size / 4给出290105318.作业的执行给出了以下输出:

2016-11-19 12:30:36,426 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 1
2016-11-19 12:30:36,535 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 4
2016-11-19 12:30:36,572 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:7

分割数为7,而不是4,成功作业的输出为:

File System Counters
    FILE: Number of bytes read=18855390277
    FILE: Number of bytes written=14653469965
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
Map-Reduce Framework
    Map input records=39184416
    Map output records=36751473
    Map output bytes=787022241
    Map output materialized bytes=860525313
    Input split bytes=1801
    Combine input records=0
    Combine output records=0
    Reduce input groups=25064998
    Reduce shuffle bytes=860525313
    Reduce input records=36751473
    Reduce output records=1953960
    Spilled Records=110254419
    Shuffled Maps =21
    Failed Shuffles=0
    Merged Map outputs=21
    GC time elapsed (ms)=1124
    CPU time spent (ms)=0
    Physical memory (bytes) snapshot=0
    Virtual memory (bytes) snapshot=0
    Total committed heap usage (bytes)=6126829568
Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
File Input Format Counters 
    Bytes Read=0
File Output Format Counters 
    Bytes Written=77643084

地图显示它处理了21个混洗地图。我希望它只处理4个映射器。对于reducer,它提供正确数量的文件,总数为3.我的mapper分割大小设置是否错误?

1 个答案:

答案 0 :(得分:0)

我相信你正在使用TextInputFormat。

  1. 如果您有多个文件,则每个文件至少会生成一个映射器。如果文件大小(不是累积但个别)超过块大小(通过设置min和max进行了调整),将再生成更多的映射器。

  2. 尝试使用combineTextInputFormat,它可以帮助您实现您想要的效果,但仍然可能不完全是4.

  3. 查看您用于确定的InputFormats的逻辑,将生成多少个映射器。