如果文件很大,那么每个文件的单个地图将成为瓶颈。创建一个新的InputFormat并开始使用它。这是相同的代码。
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
}
You will get a mapper for each split. You can bypass this in a few ways, the first being to set mapred.min.split.size large so that none of the files meet the split criteria.
Another option is to implement your own InputFormat as Praveen suggests. There are a few already created, though I don't know their state with current versions of Hadoop. https://gist.github.com/sritchie/808035 and http://lordjoesoftware.blogspot.com/2010/08/customized-splitters-and-readers.html are a few though they are old.
Another simple option would be to put your files up in a format that is not splittable. GZip comes to mind however it does create a little overhead due to decompressing the files. More overhead is involved if the size of the gzipped file is larger then the block size due to the fact that it will be placed on different nodes and have to be combined BEFORE it can be put through the map task.