H2O群集启动频繁超时

时间:2018-08-14 19:57:41

标签: h2o

尝试通过python在h2o上启动(MapR) hadoop集群

# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re

from Queue import Queue, Empty
from threading import Thread

def enqueue_output(out, queue):
    """
    Function for communicating streaming text lines from seperate thread.
    see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
    """
    for line in iter(out.readline, b''):
        queue.put(line)
    out.close()

# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
    print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))

# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)), 
                             shell=False, 
                             stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()

# read line without blocking
h2o_url_out = ''
while True:
    try:  line = q.get_nowait() # or q.get(timeout=.1)
    except Empty:
        continue
    else: # got line
        print line
        # check for first instance connection url output
        if re.search('Open H2O Flow in your web browser', line) is not None:
            h2o_url_out = line
            break
        if re.search('Error', line) is not None:
            print 'Error generated: %s' % line
            sys.exit()

print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip

经常引发超时错误

Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster

查看文档(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html)(并且仅对单词“ timeout ”进行了字词查找),无法找到任何有助于该问题的内容(例如,延长超时时间)通过hadoop jar h2odriver.jar -timeout <some time>设置时间,只是延长时间直到出现超时错误。

已经注意到,这通常发生在另一个h2o集群实例已经启动并正在运行的情况下(我不知道,因为我认为YARN可以支持多个实例),但有时 当没有其他集群初始化时。

除了h2o抛出的错误消息之外,还有其他人可以尝试解决此问题或获得更多调试信息吗?


更新

尝试从命令行重新创建问题,得到

[me@mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 172.18.4.62]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032

----- YARN cluster metrics -----
Number of YARN worker nodes: 6

----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used,  0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used,  0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used,  2.0 / 8.7 GB used, 1 / 2 vcores used

----- Queues -----
Queue name:            root.default
    Queue state:       RUNNING
    Current capacity:  0.00
    Capacity:          0.00
    Maximum capacity:  -1.00
    Application count: 0

Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used

----------------------------------------------------------------------

WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR:   Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations

----------------------------------------------------------------------

For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'

并注意后面的输出

WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB) 
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0) 
ERROR:   Only 3 out of the requested 4 worker containers were started due to YARN cluster

我对报告的0GB内存感到困惑。和0个vcore,因为群集上没有其他应用程序正在运行,并在YARN RM Web UI显示中查看了群集详细信息

enter image description here

(使用图像,因为在日志文件中找不到此信息的统一位置,以及尽管没有其他正在运行的应用程序,但为什么内存可用性如此不均匀,我不知道)。在这一点上,应该提到的是,没有太多经验来修改/检查YARN配置,因此我现在很难找到相关信息。

可能是我从-mapperXmx=6g开始h2o集群,但是(如图所示)其中一个节点只有5g mem。可用,因此,如果随机选择此节点以贡献给初始化的h2o应用程序,则该节点没有足够的内存来支持请求的映射程序mem。将启动命令更改为/bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir并多次启动/停止而没有错误似乎支持该理论(尽管需要进一步检查以确定我是否正确解释了这些内容)。

1 个答案:

答案 0 :(得分:0)

这很可能是因为您的Hadoop集群很忙,并且没有空间来启动新的yarn容器。

如果您请求N个节点,那么您要么得到所有N个节点,要么启动过程像您看到的那样超时。您可以选择使用-timeout命令行标志来增加超时。