Question

根据Hadoop手册，如果插槽可用，则应在输入数据存储在HDFS上的节点上启动映射器作业。

不幸的是，我发现在使用Hadoop Streaming库时并非如此，因为与输入文件的物理位置（由-input标志给出）相比，作业在完全不同的节点上启动，而没有其他工作在系统上执行。

在多个系统上测试，Hadoop 2.6.0和2.7.2 有没有办法影响这种行为，因为输入文件很大，这种不必要的网络流量会显着改变整体性能？

*更新*

根据评论中的建议，我使用“普通”Hadoop作业测试了该问题，特别是使用经典WordCount example。结果是相同的，从10次执行中只有1次是选择了数据已存在的映射器执行节点。我通过Web界面（NameNode和YARN Web UI）和命令行工具（hdfs fsck ...并读取YARN的日志文件）验证了块可用性和所选的执行节点。我刚刚重新启动群集并验证没有其他干扰作业正在运行。

我还注意到，Data-local map tasks计数器甚至不存在于作业的摘要输出中，我只得到Rack-local map tasks。当然它是机架本地的，因为在这个测试环境中只有一个机架。是否有我缺少的配置选项？

File System Counters
        FILE: Number of bytes read=3058
        FILE: Number of bytes written=217811
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=3007
        HDFS: Number of bytes written=2136
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Rack-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=12652
        Total time spent by all reduces in occupied slots (ms)=13368
        Total time spent by all map tasks (ms)=3163
        Total time spent by all reduce tasks (ms)=13368
        Total vcore-seconds taken by all map tasks=3163
        Total vcore-seconds taken by all reduce tasks=13368
        Total megabyte-seconds taken by all map tasks=12955648
        Total megabyte-seconds taken by all reduce tasks=13688832
Map-Reduce Framework
        Map input records=9
        Map output records=426
        Map output bytes=4602
        Map output materialized bytes=3058
        Input split bytes=105
        Combine input records=426
        Combine output records=229
        Reduce input groups=229
        Reduce shuffle bytes=3058
        Reduce input records=229
        Reduce output records=229
        Spilled Records=458
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=75
        CPU time spent (ms)=1480
        Physical memory (bytes) snapshot=440565760
        Virtual memory (bytes) snapshot=4798459904
        Total committed heap usage (bytes)=310902784
Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
File Input Format Counters
        Bytes Read=2902
File Output Format Counters
        Bytes Written=2136

*更新2 *

我意识到摘要中缺少Data-local map tasks计数器的唯一原因是因为它是0，所以它被省略了。我通过net.topology.script.file.name参数重新配置了群集，因此每个节点都是一个单独的机架（请参阅Hadoop Manual中的bash示例）。

现在Hadoop自豪地报告说，执行的任务甚至不是机架本地的。看起来调度程序（我使用默认的 CapacityScheduler ）根本不关心数据或机架位置！

Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Other local map tasks=1

Hadoop作业未实现数据位置

0 个答案: