从Hadoop日志中我如何找到中间输出字节大小&减少输出字节大小?

时间:2015-06-18 06:25:34

标签: hadoop

从hadoop日志中,我如何估计Mappers的总中间输出的大小(以字节为单位)和Reducers的总输出大小(以字节为单位)?

我的映射器和缩减器使用LZO压缩,我想知道压缩后映射器/减速器输出的大小。

15/06/06 17:19:15 INFO mapred.JobClient:  map 100% reduce 94%
15/06/06 17:19:16 INFO mapred.JobClient:  map 100% reduce 98%
15/06/06 17:19:17 INFO mapred.JobClient:  map 100% reduce 99%
15/06/06 17:20:04 INFO mapred.JobClient:  map 100% reduce 100%
15/06/06 17:20:05 INFO mapred.JobClient: Job complete: job_201506061602_0026
15/06/06 17:20:05 INFO mapred.JobClient: Counters: 30
15/06/06 17:20:05 INFO mapred.JobClient:   Job Counters 
15/06/06 17:20:05 INFO mapred.JobClient:     Launched reduce tasks=401
15/06/06 17:20:05 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=1203745
15/06/06 17:20:05 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient:     Rack-local map tasks=50
15/06/06 17:20:05 INFO mapred.JobClient:     Launched map tasks=400
15/06/06 17:20:05 INFO mapred.JobClient:     Data-local map tasks=350
15/06/06 17:20:05 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6642599
15/06/06 17:20:05 INFO mapred.JobClient:   File Output Format Counters 
15/06/06 17:20:05 INFO mapred.JobClient:     Bytes Written=534808008
15/06/06 17:20:05 INFO mapred.JobClient:   FileSystemCounters
15/06/06 17:20:05 INFO mapred.JobClient:     FILE_BYTES_READ=247949371
15/06/06 17:20:05 INFO mapred.JobClient:     HDFS_BYTES_READ=168030609
15/06/06 17:20:05 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=651797418
15/06/06 17:20:05 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=534808008
15/06/06 17:20:05 INFO mapred.JobClient:   File Input Format Counters 
15/06/06 17:20:05 INFO mapred.JobClient:     Bytes Read=167978609
15/06/06 17:20:05 INFO mapred.JobClient:   Map-Reduce Framework
15/06/06 17:20:05 INFO mapred.JobClient:     Map output materialized bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient:     Map input records=3774768
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce shuffle bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient:     Spilled Records=56007636
15/06/06 17:20:05 INFO mapred.JobClient:     Map output bytes=336045816
15/06/06 17:20:05 INFO mapred.JobClient:     Total committed heap usage (bytes)=592599187456
15/06/06 17:20:05 INFO mapred.JobClient:     CPU time spent (ms)=9204120
15/06/06 17:20:05 INFO mapred.JobClient:     Combine input records=0
15/06/06 17:20:05 INFO mapred.JobClient:     SPLIT_RAW_BYTES=52000
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce input records=28003818
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce input groups=11478107
15/06/06 17:20:05 INFO mapred.JobClient:     Combine output records=0
15/06/06 17:20:05 INFO mapred.JobClient:     Physical memory (bytes) snapshot=516784615424
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce output records=94351104
15/06/06 17:20:05 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1911619866624
15/06/06 17:20:05 INFO mapred.JobClient:     Map output records=28003818

1 个答案:

答案 0 :(得分:2)

您可以使用FileSystemCounters获取这些信息。此计数器中使用的术语详情如下:

FILE_BYTES_READ是本地文件系统读取的字节数。假设所有地图输入数据都来自HDFS,那么在地图阶段,FILE_BYTES_READ应为零。另一方面,reducers的输入文件是从映射端磁盘获取的reduce端本地磁盘上的数据。因此,FILE_BYTES_READ表示reducers读取的总字节数。

FILE_BYTES_WRITTEN由两部分组成。第一部分来自地图制作者。所有映射器都会将中间输出溢出到磁盘。映射器写入磁盘的所有字节都将包含在FILE_BYTES_WRITTEN中。第二部分来自减速器。在随机播放阶段,所有Reducer将从映射器获取中间数据并合并并溢出到reducer端磁盘。 Reducer写入磁盘的所有字节也将包含在FILE_BYTES_WRITTEN中。

HDFS_BYTES_READ表示作业启动时映射器从HDFS读取的字节数。此数据不仅包括源文件的内容,还包括有关拆分的元数据。

HDFS_BYTES_WRITTEN表示写入HDFS的字节数。它是最终输出的字节数。