根据我的hadoop集群的HDFS块大小为64MB
我在HDFS位置内有4个小输入文件
这4个文件位于/ user / cloudera / inputfiles目录
中words1.txt is 1 MB of size
words2.txt is 1 MB of size
words3.txt is 1 MB of size
words4.txt is 1 MB of size
我为wordcount创建了一个简单的mapreduce程序,mapreduce程序从目录/ user / cloudera / inputfiles中读取所有上述4个文件
我可以看到有4个maptasks被执行,因为所有4个输入文件都比HDFS块大小小。
现在我想用只有一个maptask来执行这个mapreduce程序。
因此我应用了以下粗体编码。
conf.set(" mapreduce.input.fileinputformat.split.minsize"," 134217728");
conf.set(" mapreduce.input.fileinputformat.split.maxsize"," 134217728");
驱动程序:
public class WordCountMain {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
// to remove the output directory if it exists
FileSystem fs = FileSystem.get(conf);
fs.delete(new Path(args[1]));
// to remove the output directory if it exists
conf.set("mapreduce.input.fileinputformat.split.minsize", "134217728");
conf.set("mapreduce.input.fileinputformat.split.maxsize", "134217728");
//conf.set("mapred.min.split.size", "134217728");
//conf.set("mapred.max.split.size", "134217728");
Job job = new Job(conf);
job.setJarByClass(WordCountMain.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
System.exit(job.waitForCompletion(true)? 0:1);
}
}
但如果我执行这项工作,我仍然会看到4个maptasks被执行。 我想知道为什么我的mapreduce作业仍然使用4个maptasks,我在哪里只应用一个地图任务的相应设置。
我按照以下命令运行mapreduce程序
hadoop jar /home/myjars/WordCountMain.jar /user/cloudera/inputfiles/ /user/cloudera/outputfiles