目标:
实施HashPartition并检查自动创建的减少器的数量。
任何帮助和任何示例代码总是为此目的而受到赞赏。
我做了什么:
我使用在250MB csv文件上实现的Hash Partition运行了map reduce程序。 但我仍然看到hdfs只使用1个reducer来进行聚合。如果我已正确理解,hdfs应自动创建分区并均匀分布数据。然后n个reducer将对创建的n个分区起作用。但我没有看到这种情况发生。任何人都可以通过Hash Partitions帮助我实现这一目标。我不想定义分区的数量。
映射器代码:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
String airlineid = line[7];
//int tailno = Integer.parseInt(line[10].replace("\"", ""));
String tailno = line[9].replace("\"", "");
if (tailno.length() != 0 ){
//System.out.println(airlineid + " " + tailno + " " + tailno.length());
context.write(new Text(airlineid), new Text(tailno));
}
}
}
减速机代码:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int count=0;
for (Text value : values) {
count ++;
}
//context.write(key, new IntWritable(maxValue));
context.write(key, new IntWritable(count));
}
分区代码:
public class FlightPartition extends Partitioner<Text, Text> {
public int getPartition(Text key, Text value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
驱动程序:
public class Flight
{
public static void main (String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Flight");
job.setJarByClass(Flight.class);
job.setMapperClass(FlightMapper.class);
job.setReducerClass(FlightReducer.class);
job.setPartitionerClass(FlightPartition.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
记录:
15/11/09 06:14:14 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=7008211
FILE: Number of bytes written=14438683
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=211682444
HDFS: Number of bytes written=178
HDFS: Number of read operations=12
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Killed map tasks=2
Launched map tasks=5
Launched reduce tasks=1
Data-local map tasks=5
Total time spent by all maps in occupied slots (ms)=2235296
Total time spent by all reduces in occupied slots (ms)=606517
Total time spent by all map tasks (ms)=2235296
Total time spent by all reduce tasks (ms)=606517
Total vcore-seconds taken by all map tasks=2235296
Total vcore-seconds taken by all reduce tasks=606517
Total megabyte-seconds taken by all map tasks=2288943104
Total megabyte-seconds taken by all reduce tasks=621073408
Map-Reduce Framework
Map input records=470068
Map output records=467281
Map output bytes=6073643
Map output materialized bytes=7008223
Input split bytes=411
Combine input records=0
Combine output records=0
Reduce input groups=15
Reduce shuffle bytes=7008223
Reduce input records=467281
Reduce output records=15
Spilled Records=934562
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=3701
CPU time spent (ms)=277080
Physical memory (bytes) snapshot=590581760
Virtual memory (bytes) snapshot=3196801024
Total committed heap usage (bytes)=441397248
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=211682033
File Output Format Counters
Bytes Written=178
答案 0 :(得分:0)
检查您的mapred-default.xml
文件并查找
mapreduce.job.reduces
财产。将值更改为&gt; 1表示群集中的更多减速器。如果mapreduce.jobtracker.address是&#34; local
&#34;。
您可以使用
覆盖java中的默认属性job.setNumReduceTasks(3)
请查看此article以获取Apache的mapred-default.xml的完整列表。
减少了多少?(来自Apache)
正确的减少数量似乎是0.95或1.75乘以(*)。
使用0.95时,所有缩减都可以立即启动,并在地图完成后开始传输地图输出。使用1.75,更快的节点将完成第一轮减少并启动第二波减少,从而更好地实现负载平衡。
增加减少的数量会增加框架开销,但会增加负载平衡并降低故障成本。
有多少张地图?
地图数量通常由输入的总大小驱动,即输入文件的总块数。
地图的正确并行度似乎是每个节点大约10-100个地图,尽管已经为非常cpu-light地图任务设置了300个地图。任务设置需要一段时间,因此最好是地图至少需要一分钟才能执行。
因此,如果您期望10TB的输入数据并且块大小为128MB,那么除非Configuration.set(MRJobConfig.NUM_MAPS,int)(仅提供框架提示),否则您最终会得到82,000个映射。过去常常把它设得更高。