此问题与上一个问题Worker node-status on a Ray EC2 cluster: update-failed有关;在将Ray用于EC2群集时。尽管配置指定了2个工作节点,但集群似乎仅使用头节点。以下是拖尾监视器的输出,其中包含一个我不理解的重复错误。 (我将其放在一个新的问题中,因为有很多文字,但可能与该问题无关。)
新错误跟踪:-
$ ray exec ray_conf.yaml 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'
2019-05-26 20:33:14,588 INFO updater.py:95 -- NodeUpdater: Waiting for IP of i-04a42aa146ce9b129...
2019-05-26 20:33:14,588 INFO log_timer.py:21 -- NodeUpdater: i-04a42aa146ce9b129: Got IP [LogTimer=414ms]
2019-05-26 20:33:14,594 INFO updater.py:272 -- NodeUpdater: Running tail -n 100 -f /tmp/ray/session_*/logs/monitor* on 100.24.20.34...
==> /tmp/ray/session_2019-05-27_00-31-35_902117_10123/logs/monitor.err <==
2019-05-27 00:31:52,106 INFO autoscaler.py:647 -- LoadMetrics: MostDelayedHeartbeats={'172.31.58.46': 0.33398985862731934}, NodeIdleSeconds=Min=14 Mean=14 Max=14, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/36.0 b'CPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-05-27 00:31:57,062 INFO autoscaler.py:646 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating) (bringup=True)
2019-05-27 00:31:57,063 INFO autoscaler.py:647 -- LoadMetrics: MostDelayedHeartbeats={'172.31.58.46': 0.270449161529541}, NodeIdleSeconds=Min=19 Mean=19 Max=19, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/36.0 b'CPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-05-27 00:31:57,331 INFO updater.py:272 -- NodeUpdater: Running uptime on 172.31.57.23...
2019-05-27 00:32:02,076 INFO updater.py:272 -- NodeUpdater: Running uptime on 172.31.55.204...
2019-05-27 00:32:02,110 INFO autoscaler.py:646 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating) (bringup=True)
2019-05-27 00:32:02,110 INFO autoscaler.py:647 -- LoadMetrics: MostDelayedHeartbeats={'172.31.58.46': 0.2268660068511963}, NodeIdleSeconds=Min=24 Mean=24 Max=24, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/36.0 b'CPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-05-27 00:32:02,544 INFO log_timer.py:21 -- NodeUpdater: i-09402f41cdaf55b70: Got SSH [LogTimer=20562ms]
2019-05-27 00:32:02,547 INFO log_timer.py:21 -- NodeUpdater: i-09402f41cdaf55b70: Initialization commands completed [LogTimer=4ms]
2019-05-27 00:32:02,548 INFO updater.py:272 -- NodeUpdater: Running export RAY_HEAD_IP=172.31.58.46; sudo pkill -9 apt-get || true on 172.31.55.204...
2019-05-27 00:32:02,641 INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=setting-up on ['i-09402f41cdaf55b70'] [LogTimer=97ms]
2019-05-27 00:32:02,661 INFO updater.py:272 -- NodeUpdater: Running export RAY_HEAD_IP=172.31.58.46; sudo pkill -9 dpkg || true on 172.31.55.204...
2019-05-27 00:32:02,750 INFO updater.py:272 -- NodeUpdater: Running export RAY_HEAD_IP=172.31.58.46; sudo dpkg --configure -a on 172.31.55.204...
2019-05-27 00:32:02,851 INFO updater.py:272 -- NodeUpdater: Running export RAY_HEAD_IP=172.31.58.46; sudo apt-get update on 172.31.55.204...
2019-05-27 00:32:07,176 INFO autoscaler.py:646 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating) (bringup=True)
2019-05-27 00:32:07,177 INFO autoscaler.py:647 -- LoadMetrics: MostDelayedHeartbeats={'172.31.58.46': 0.2408006191253662}, NodeIdleSeconds=Min=29 Mean=29 Max=29, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/36.0 b'CPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-05-27 00:32:07,358 INFO updater.py:272 -- NodeUpdater: Running uptime on 172.31.57.23...
2019-05-27 00:32:08,403 INFO updater.py:272 -- NodeUpdater: Running export RAY_HEAD_IP=172.31.58.46; sudo apt-get install -y build-essential on 172.31.55.204...
2019-05-27 00:32:08,729 INFO log_timer.py:21 -- NodeUpdater: i-09402f41cdaf55b70: Setup commands completed [LogTimer=6181ms]
2019-05-27 00:32:08,729 INFO log_timer.py:21 -- NodeUpdater: i-09402f41cdaf55b70: Applied config c4e33aa96ec128145b1a482dde318746d3aa8234 [LogTimer=26767ms]
2019-05-27 00:32:08,730 ERROR updater.py:145 -- NodeUpdater: i-09402f41cdaf55b70: Error updating (Exit Status 100) ssh -i ~/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ubuntu_ray_ssh_sockets/18_c48large/%C -o ControlPersist=10s ubuntu@172.31.55.204 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && export RAY_HEAD_IP=172.31.58.46; sudo apt-get install -y build-essential'
Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 148, in run
raise e
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 137, in run
self.do_update()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 236, in do_update
self.ssh_cmd(cmd, redirect=redirect)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 295, in ssh_cmd
stderr=redirect or sys.stderr)
File "/home/ubuntu/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '~/ray_bootstrap_key.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ubuntu_ray_ssh_sockets/18_c48large/%C', '-o', 'ControlPersist=10s', 'ubuntu@172.31.55.204', "bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && export RAY_HEAD_IP=172.31.58.46; sudo apt-get install -y build-essential'"]' returned non-zero exit status 100.
然后在Thread-7中出现相同的异常,后跟:-
==> /tmp/ray/session_2019-05-27_00-31-35_902117_10123/logs/monitor.err <==
2019-05-27 00:33:17,843 INFO autoscaler.py:646 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update) (bringup=True)
2019-05-27 00:33:17,844 INFO autoscaler.py:647 -- LoadMetrics: MostDelayedHeartbeats={'172.31.55.204': 65.62029552459717, '172.31.57.23': 45.396358251571655, '172.31.58.46': 0.21964216232299805}, NodeIdleSeconds=Min=100 Mean=100 Max=100, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/36.0 b'CPU', TimeSinceLastHeartbeat=Min=0 Mean=37 Max=65
...无限期重复。</ p>
答案 0 :(得分:0)
我发现退出状态100表示“容器在丢失的节点上释放”。在这种情况下,问题是两名工作人员的更新程序失败。这与我的原始问题相同:“ Ray EC2群集上的工作器节点状态:更新失败”,所以我要关闭此问题。