Hadoop作业未实现数据位置

时间:2017-01-24 14:56:57

标签: hadoop hadoop-streaming

根据Hadoop手册,如果插槽可用,则应在输入数据存储在HDFS上的节点上启动映射器作业。

不幸的是,我发现在使用Hadoop Streaming库时并非如此,因为与输入文件的物理位置(由-input标志给出)相比,作业在完全不同的节点上启动,而没有其他工作在系统上执行。

在多个系统上测试,Hadoop 2.6.0和2.7.2 有没有办法影响这种行为,因为输入文件很大,这种不必要的网络流量会显着改变整体性能?

*更新*

根据评论中的建议,我使用“普通”Hadoop作业测试了该问题,特别是使用经典WordCount example。结果是相同的,从10次执行中只有1次是选择了数据已存在的映射器执行节点。我通过Web界面(NameNode和YARN Web UI)和命令行工具(hdfs fsck ...并读取YARN的日志文件)验证了块可用性和所选的执行节点。我刚刚重新启动群集并验证没有其他干扰作业正在运行。

我还注意到,Data-local map tasks计数器甚至不存在于作业的摘要输出中,我只得到Rack-local map tasks。当然它是机架本地的,因为在这个测试环境中只有一个机架。是否有我缺少的配置选项?

File System Counters
        FILE: Number of bytes read=3058
        FILE: Number of bytes written=217811
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=3007
        HDFS: Number of bytes written=2136
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Rack-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=12652
        Total time spent by all reduces in occupied slots (ms)=13368
        Total time spent by all map tasks (ms)=3163
        Total time spent by all reduce tasks (ms)=13368
        Total vcore-seconds taken by all map tasks=3163
        Total vcore-seconds taken by all reduce tasks=13368
        Total megabyte-seconds taken by all map tasks=12955648
        Total megabyte-seconds taken by all reduce tasks=13688832
Map-Reduce Framework
        Map input records=9
        Map output records=426
        Map output bytes=4602
        Map output materialized bytes=3058
        Input split bytes=105
        Combine input records=426
        Combine output records=229
        Reduce input groups=229
        Reduce shuffle bytes=3058
        Reduce input records=229
        Reduce output records=229
        Spilled Records=458
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=75
        CPU time spent (ms)=1480
        Physical memory (bytes) snapshot=440565760
        Virtual memory (bytes) snapshot=4798459904
        Total committed heap usage (bytes)=310902784
Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
File Input Format Counters
        Bytes Read=2902
File Output Format Counters
        Bytes Written=2136

*更新2 *

我意识到摘要中缺少Data-local map tasks计数器的唯一原因是因为它是0,所以它被省略了。我通过net.topology.script.file.name参数重新配置了群集,因此每个节点都是一个单独的机架(请参阅Hadoop Manual中的bash示例)。

现在Hadoop自豪地报告说,执行的任务甚至不是机架本地的。 看起来调度程序(我使用默认的 CapacityScheduler )根本不关心数据或机架位置!

Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Other local map tasks=1

0 个答案:

没有答案