简单的MapReduce作业中的开销极大

时间:2011-08-26 10:30:32

标签: java hadoop mapreduce

我正在尝试使用Hadoop并创建一个非常简单的地图并减少工作量。输入是一个30行文本文件,输出只有3行(它是日志文件的摘录,其中地图提取页面名称和执行时间,而reduce计算min,max和avg执行倍)

这个简单的工作需要一些 36秒才能在Padoo-distributed模式下在Hadoop上执行(fs.default.name = hdfs:// localhost,dfs.replication = 1,mapred.job.tracker本地主机=:8021)。这是一个运行Ubuntu 10.04的2.93Ghz Nehalem,8GB内存,X25-E SSD。

我在mapper和reducer中为每次调用添加了调试输出,这清楚地表明每次调用mapper都是如此。减速器发生在同一秒。换句话说,开销是在Hadoop实际调用我的代码之前和之后。

这是最终输出:

18:16:59 INFO input.FileInputFormat: Total input paths to process : 1
18:16:59 INFO mapred.JobClient: Running job: job_201108251550_0023
18:17:00 INFO mapred.JobClient:  map 0% reduce 0%
18:17:15 INFO mapred.JobClient:  map 100% reduce 0%
18:17:30 INFO mapred.JobClient:  map 100% reduce 100%
18:17:35 INFO mapred.JobClient: Job complete: job_201108251550_0023
18:17:35 INFO mapred.JobClient: Counters: 25
18:17:35 INFO mapred.JobClient:   Job Counters 
18:17:35 INFO mapred.JobClient:     Launched reduce tasks=1
18:17:35 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11305
18:17:35 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient:     Launched map tasks=1
18:17:35 INFO mapred.JobClient:     Data-local map tasks=1
18:17:35 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=12762
18:17:35 INFO mapred.JobClient:   File Output Format Counters 
18:17:35 INFO mapred.JobClient:     Bytes Written=140
18:17:35 INFO mapred.JobClient:   FileSystemCounters
18:17:35 INFO mapred.JobClient:     FILE_BYTES_READ=142
18:17:35 INFO mapred.JobClient:     HDFS_BYTES_READ=7979
18:17:35 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=41899
18:17:35 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=140
18:17:35 INFO mapred.JobClient:   File Input Format Counters 
18:17:35 INFO mapred.JobClient:     Bytes Read=7877
18:17:35 INFO mapred.JobClient:   Map-Reduce Framework
18:17:35 INFO mapred.JobClient:     Reduce input groups=3
18:17:35 INFO mapred.JobClient:     Map output materialized bytes=142
18:17:35 INFO mapred.JobClient:     Combine output records=0
18:17:35 INFO mapred.JobClient:     Map input records=30
18:17:35 INFO mapred.JobClient:     Reduce shuffle bytes=0
18:17:35 INFO mapred.JobClient:     Reduce output records=3
18:17:35 INFO mapred.JobClient:     Spilled Records=12
18:17:35 INFO mapred.JobClient:     Map output bytes=124
18:17:35 INFO mapred.JobClient:     Combine input records=0
18:17:35 INFO mapred.JobClient:     Map output records=6
18:17:35 INFO mapred.JobClient:     SPLIT_RAW_BYTES=102
18:17:35 INFO mapred.JobClient:     Reduce input records=6

可以看出,地图完成需要16秒,然后需要15秒直到减少完成,然后仍然需要6秒才能完成作业。

这是仅减少任务的系统日志:

18:17:14,894 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
18:17:15,007 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
18:17:15,088 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=130514944, MaxSingleShuffleLimit=32628736
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging on-disk files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging in memory files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread waiting: Thread for merging on-disk files
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Need another 1 map output(s) where 0 is already in progress
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for polling Map Completion Events
18:17:15,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
18:17:20,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
18:17:21,103 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 0 files left.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 1 files left.
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,113 INFO org.apache.hadoop.mapred.ReduceTask: Merged 1 segments, 138 bytes to disk to satisfy reduce memory limit
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 1 files, 142 bytes from disk
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
18:17:21,114 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,117 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,163 INFO org.apache.hadoop.mapred.Task: Task:attempt_201108251550_0023_r_000000_0 is done. And is in the process of commiting
18:17:24,170 INFO org.apache.hadoop.mapred.Task: Task attempt_201108251550_0023_r_000000_0 is allowed to commit now
18:17:24,180 INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_201108251550_0023_r_000000_0' to output.txt
18:17:27,088 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201108251550_0023_r_000000_0' done.
18:17:27,091 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName akira for UID 1000 from the native implementation

我的实际减速机代码在18:17:21调用。

是什么给出的?这是正常的吗?

4 个答案:

答案 0 :(得分:7)

如果您的文件是3亿行,那么30多秒的开销可以忽略不计。

答案 1 :(得分:2)

这一切都很正常。但你为什么关心?

我希望你为一个更大的更大的数据集进行原型设计,如果不是Hadoop肯定不是你的解决方案。

Hadoop是为数百个节点而不是一个节点运行而设计的,并且期望有各种各样的故障模式。这种尺寸的系统需要非常松散的连接以具有弹性;这包括诸如tasktrackers对工作进行轮询之类的事情。这些都不是为了快速启动/关闭任务尝试而优化的。

我建议您从提交作业到完成作业时绘制自己的图片,说明涉及哪些守护进程以及可能运行的计算机。

可伸缩性!=性能

答案 2 :(得分:1)

我看待Hadoop的方式是,主要设计目标是使其水平扩展。 这个设计目标意味着Hadoop可以有效地处理大量卷。如果你有一个非常小的问题(它适合一个系统的ram并且不需要很长时间),那么正常的应用程序总是会胜过Hadoop。在这种情况下,缩放所需的开销是开销。

答案 3 :(得分:1)

hadoop从5个节点(namenode + secondary + 3 datanode)扩展到RAID5安全(3个副本)。在单个节点上,您可以测试代码,而不是更多。 在测试并进入“真正的”集群之后,您必须知道数据集的大小(块化),而namenode的RAM是一个重要参数(每个块占用RAM,更多空间,更多RAM)测试