Question

我为我的Hadoop程序提供了一个大小为4MB的输入文件（有100k记录）。由于每个HDFS块是64 MB，并且文件只适合一个块，我选择映射器的数量为1.但是，当我增加映射器的数量（让我们坐到24）时，运行时间变得更好。我不知道为什么会这样呢？因为所有文件只能由一个映射器读取。

算法的简要说明：使用configure函数从DistributeCache中读取集群，并将其存储在名为clusters的全局变量中。映射器逐行读取每个块，并找到每个行所属的簇。以下是一些代码：

public void configure(JobConf job){
        //retrieve the clusters from DistributedCache 
        try {               
            Path[] eqFile = DistributedCache.getLocalCacheFiles(job);
            BufferedReader reader = new BufferedReader(new FileReader(eqFile[0].toString()));               


            while((line=reader.readLine())!=null){
                //construct the cluster represented by ``line`` and add it to a global variable called ``clusters``

                }


            reader.close();             

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

和映射器

 public void map(LongWritable key, Text value, OutputCollector<IntWritable, EquivalenceClsAggValue> output, Reporter reporter) throws IOException {
         //assign each record to one of the existing clusters in ``clusters''.

        String record = value.toString();
        EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
        outputValue.addRecord(record);
        int eqID = MondrianTree.findCluster(record, clusters);
        IntWritable outputKey = new IntWritable(eqID);
        output.collect(outputKey,outputValue);          
    }

我输入了不同大小的文件（从4 MB到4GB）。如何找到最佳的映射器/缩减器数量？我的Hadoop集群中的每个节点都有2个核心，我有58个节点。

Answer 1

因为所有文件只能由一个映射器读取。

实际情况并非如此。要记住几点......

该单个块被复制3次（默认情况下），这意味着三个独立的节点可以访问同一个块而无需通过网络
没有理由不能将单个区块复制到多台机器上，然后他们会寻找已分配的分区

Answer 2

您需要调整“mapred.max.split.size”。以字节为单位给出适当的大小作为值。 MR框架将根据此块大小计算正确的映射器数量。

如何确定Hadoop中正确的映射器数量？

2 个答案: