映射任务数等于文件数

时间:2015-06-25 00:45:39

标签: hadoop

我是hadoop的新手并尝试运行字数计算示例。我想设置不等于输入文件数量的地图。我将目录传递给hadoop wordcount示例,共有10个文件,但创建的地图任务数量超过10.我们可以限制地图任务的数量等于文件数量,1个地图任务将一个文件作为输入。

我正在使用版本1 hadoop。

2 个答案:

答案 0 :(得分:2)

如果文件很大,那么每个文件的单个地图将成为瓶颈。创建一个新的InputFormat并开始使用它。这是相同的代码。

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class NonSplittableTextInputFormat extends TextInputFormat {

    @Override
    protected boolean isSplitable(JobContext context, Path file) {
        return false;
    }

}

答案 1 :(得分:2)

You will get a mapper for each split. You can bypass this in a few ways, the first being to set mapred.min.split.size large so that none of the files meet the split criteria. Another option is to implement your own InputFormat as Praveen suggests. There are a few already created, though I don't know their state with current versions of Hadoop. https://gist.github.com/sritchie/808035 and http://lordjoesoftware.blogspot.com/2010/08/customized-splitters-and-readers.html are a few though they are old. Another simple option would be to put your files up in a format that is not splittable. GZip comes to mind however it does create a little overhead due to decompressing the files. More overhead is involved if the size of the gzipped file is larger then the block size due to the fact that it will be placed on different nodes and have to be combined BEFORE it can be put through the map task.