基于问题和后续回答here:启动在hadoop群集上运行的h2o
实例时(例如hadoop jar h2odriver.jar -nodes 4 -mapperXmx 6g -output hdfsOutputDir
)用于连接到h2o的回调IP地址实例由hadoop运行时选择。因此,在大多数情况下,通过Hadoop运行时选择IP地址和端口以查找最佳可用状态,并且看起来像
....
H2O node 172.18.4.63:54321 reports H2O cluster size 4
H2O node 172.18.4.67:54321 reports H2O cluster size 4
H2O cluster (4 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
Open H2O Flow in your web browser: http://172.18.4.67:54321
Connection url output line: Open H2O Flow in your web browser: http://172.18.4.67:54321
使用h2o
的推荐方法是每次要使用时启动和停止单个实例(抱歉,目前无法找到支持文档)。这里的问题是,如果您希望自己的python代码启动并自动连接到h2o
实例,那么在h2o
实例已经启动并运行之前,它不会知道连接到哪个IP 。因此,在Hadoop上启动H2O集群的常用方法是让Hadoop决定集群,然后解析行的输出
Open H2O Flow in your web browser: x.x.x.x:54321
获取/提取IP地址。
这里的问题是h2o
是一个阻止过程,当实例启动而不是批量生成时,其输出打印为文本行的流,这使得它成为了它我很难使用基本的python Popen逻辑来获取输出所需的最终输出行。有没有办法在生成输出时捕获输出以获得具有连接IP的线路?
答案 0 :(得分:0)
我最终使用的解决方案是在一个单独的线程中启动h2o
进程,并通过我们读取的队列将输出传递回主线程,并使用正则表达式来搜索连接IP。见下面的例子。
# startup hadoop h2o cluster
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# series of commands to run in-order for for bringing up the h2o cluster on demand
startup_cmds = [
# remove any existing tmp log dir. for h2o processes
'rm -r /some/location/for/h2odriver.jar/output',
# start h2o on cluster
'/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -output hdfsOutputDir'.format("/local/h2o/start/path")
]
# clear legacy temp. dir.
if os.path.isdir(/some/location/for/h2odriver.jar/output):
print subprocess.check_output(shlex.split(startup_cmds[0]))
# start h2o service in background thread
startup_p = subprocess.Popen(shlex.split(startup_cmds[1]),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search("Open H2O Flow in your web browser", line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
# capture connection IP from h2o process output
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search("(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)", h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip