Question

试图在HDP 3.1群集上运行h2o并遇到似乎与YARN资源容量有关的错误...

#include <stdio.h>

void main(){
  int x[] = {2, 5}, *ptr = &x;
  printf("%d\n", ptr);
  printf("%p\n", ptr);

  printf("\n%d %d\n", ptr, (ptr + 1));
}

在Ambari UI中的YARN配置中，找不到这些属性。但是，在YARN资源管理器用户界面中检查YARN日志，并检查一些被杀死的应用程序的日志时，我看到了似乎是无法访问的主机错误...

[ml1user@HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 192.168.122.1]
    [Possible callback IP address: 172.18.4.49]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200

----- YARN cluster metrics -----
Number of YARN worker nodes: 3

----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

----- Queues -----
Queue name:            default
    Queue state:       RUNNING
    Current capacity:  0.00
    Capacity:          1.00
    Maximum capacity:  1.00
    Application count: 0

Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used

----------------------------------------------------------------------

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

       A common cause for this is the requested container size (11.0 GB)
       exceeds the following YARN settings:

           yarn.nodemanager.resource.memory-mb
           yarn.scheduler.maximum-allocation-mb

----------------------------------------------------------------------

For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'

记录“ java.net.NoRouteToHostException：主机没有路由（主机不可达）” 。但是，我可以彼此访问所有其他节点，并且它们也可以彼此ping通，因此不确定此处发生了什么。有任何调试或修复建议吗？

Answer 1

我认为我发现了问题， TLDR：firewalld（运行在centos7上的节点）仍在运行，应在HDP群集上禁用该时间。

来自另一个社区post：

为使Ambari在安装过程中与其部署和管理的主机进行通信，必须打开某些端口并可用。最简单的方法是暂时禁用iptables，如下所示：

Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781 LogAggregationType: AGGREGATED ============================================================================================= LogType:stderr LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019 LogLength:2203 LogContents: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. java.net.NoRouteToHostException: No route to host (Host unreachable) at java.net.PlainSocketImpl.socketConnect(Native Method) .... at java.net.Socket.<init>(Socket.java:211) at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38) End of LogType:stderr ***********************************************************************

systemctl disable firewalld

因此，显然需要在整个集群中禁用service firewalld stop和iptables（可以在here找到支持的文档，我只在Ambari安装节点上禁用了它们）。在整个集群上停止这些服务之后（我建议使用clush），能够顺利运行纱线工作。

Answer 2

通常，此问题是由于DNS配置错误，防火墙或网络无法访问而引起的。引用this官方文档：

配置文件中远程计算机的主机名错误

客户端的主机表/ etc / hosts的目标主机IP地址无效。

DNS服务器的主机表的目标主机IP地址无效。

客户端的路由表（在Linux中为iptables）是错误的。

DHCP服务器正在发布错误的路由信息。

客户端和服务器位于不同的子网中，并且未设置为可以相互通信。这可能是偶然的，或者是故意锁定Hadoop集群。

机器正在尝试使用IPv6进行通信。 Hadoop当前不支持IPv6

主机的IP地址已更改，但是寿命长的JVM正在缓存旧值。这是JVM的已知问题（搜索“ java否定DNS缓存”以获取详细信息和解决方案）。快速解决方案：重新启动JVM

对我来说，问题在于驱动程序位于Docker容器内，这使工作人员无法将数据发送回该容器。换句话说，工作人员和驱动程序不在同一子网中。 this answer中给出的解决方案是设置以下配置：

spark.driver.host=<container's host IP accessible by the workers>
spark.driver.bindAddress=0.0.0.0
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>

YARN抱怨java.net.NoRouteToHostException：没有到主机的路由（主机不可达）

2 个答案: