根据Hadoop手册,如果插槽可用,则应在输入数据存储在HDFS上的节点上启动映射器作业。
不幸的是,我发现在使用Hadoop Streaming库时并非如此,因为与输入文件的物理位置(由-input
标志给出)相比,作业在完全不同的节点上启动,而没有其他工作在系统上执行。
在多个系统上测试,Hadoop 2.6.0和2.7.2 有没有办法影响这种行为,因为输入文件很大,这种不必要的网络流量会显着改变整体性能?
*更新*
根据评论中的建议,我使用“普通”Hadoop作业测试了该问题,特别是使用经典WordCount example。结果是相同的,从10次执行中只有1次是选择了数据已存在的映射器执行节点。我通过Web界面(NameNode和YARN Web UI)和命令行工具(hdfs fsck ...
并读取YARN的日志文件)验证了块可用性和所选的执行节点。我刚刚重新启动群集并验证没有其他干扰作业正在运行。
我还注意到,Data-local map tasks
计数器甚至不存在于作业的摘要输出中,我只得到Rack-local map tasks
。当然它是机架本地的,因为在这个测试环境中只有一个机架。是否有我缺少的配置选项?
File System Counters
FILE: Number of bytes read=3058
FILE: Number of bytes written=217811
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3007
HDFS: Number of bytes written=2136
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=12652
Total time spent by all reduces in occupied slots (ms)=13368
Total time spent by all map tasks (ms)=3163
Total time spent by all reduce tasks (ms)=13368
Total vcore-seconds taken by all map tasks=3163
Total vcore-seconds taken by all reduce tasks=13368
Total megabyte-seconds taken by all map tasks=12955648
Total megabyte-seconds taken by all reduce tasks=13688832
Map-Reduce Framework
Map input records=9
Map output records=426
Map output bytes=4602
Map output materialized bytes=3058
Input split bytes=105
Combine input records=426
Combine output records=229
Reduce input groups=229
Reduce shuffle bytes=3058
Reduce input records=229
Reduce output records=229
Spilled Records=458
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=75
CPU time spent (ms)=1480
Physical memory (bytes) snapshot=440565760
Virtual memory (bytes) snapshot=4798459904
Total committed heap usage (bytes)=310902784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2902
File Output Format Counters
Bytes Written=2136
*更新2 *
我意识到摘要中缺少Data-local map tasks
计数器的唯一原因是因为它是0,所以它被省略了。我通过net.topology.script.file.name
参数重新配置了群集,因此每个节点都是一个单独的机架(请参阅Hadoop Manual中的bash示例)。
现在Hadoop自豪地报告说,执行的任务甚至不是机架本地的。 看起来调度程序(我使用默认的 CapacityScheduler )根本不关心数据或机架位置!
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1