Question

根据mpi4py演示目录中hellowworld.py脚本的测试，我已成功配置mpi，并在三个节点上支持mpi4py：

gms@host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py

Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.

我现在正试图在ipython中使用它并将我的机器文件添加到我的$ IPYTHON_DIR / profile_mpi / ipcluster_config.py文件中，如下所示：

c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]

然后我使用命令ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &

在我的头节点上启动iPython notebook

并且，瞧，我可以从本地网络上的其他设备访问它。但是，当我从iPython笔记本运行helloworld.py时，我只收到来自头节点的响应：Hello, World! I am process 0 of 10 on host.

我从iPython开始使用10个引擎的mpi，但是......

我进一步配置了这些参数，以防万一

在$ IPYTHON_DIR / profile_mpi / ipcluster_config.py

中

c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'

在$ IPYTHON_DIR / profile_mpi / ipengine_config.py

中

c.MPI.use = 'mpi4py'

在$ IPYTHON_DIR / profile_mpi / ipcontroller_config.py

中

c.HubFactory.ip = '*'

但是，这些也无济于事。

为了让这项工作正常，我错过了什么？

编辑更新1

我现在在工作节点上安装了NFS目录，因此，我正在满足要求“目前ipcluster要求IPYTHONDIR / profile_ / security目录存在于控制器和引擎都能看到的共享文件系统上。”能够使用ipcluster命令ipcluster start --profile=mpi -n 6 &来启动我的控制器和引擎。

所以，我在我的头节点发出这个，然后得到：

2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False] 2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher 2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher 2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully

然后，继续发出相同的命令来启动其他节点上的引擎，但我得到：

2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False] 2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher 2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher [1]+ Stopped ipcluster start --profile=mpi -n 6

没有确认Engines appear to have started successfully ...

更令人困惑的是，当我在工作节点上执行ps au时，我得到：

gms       3862  0.1  2.5  38684 23740 pts/0    T    20:31   0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms       3874  0.1  1.7  21428 16772 pts/0    T    20:31   0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms       3875  0.0  0.2   4768  2288 pts/0    T    20:31   0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms       3876  0.0  0.4   5732  4132 pts/0    T    20:31   0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms       3877  0.0  0.1   4816  1204 pts/0    T    20:31   0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms       3878  0.0  0.4   5732  4028 pts/0    T    20:31   0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms       3879  0.0  0.6   8944  6008 pts/0    T    20:31   0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms       3880  0.0  0.6   8944  6108 pts/0    T    20:31   0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config

进程3376和3378中的IP地址来自群集中的其他主机。但...

当我直接使用ipython运行类似的测试时，我得到的是来自localhost的响应（尽管，减去ipython，这可以直接使用mpi和mpi4py，如我原帖中所述）：

gms@head:~/development/mpi$ ipython test.py
head[3834]: 0/1

gms@head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10

我似乎仍然遗漏了一些明显的东西，尽管我确信我的配置现在是正确的。突然出现的一件事是，当我在工作节点上启动ipcluster时，我得到了这个：2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid

编辑更新2

这更像是要记录正在发生的事情，并希望最终能够实现这一目标：

我清理了我的日志文件并重新发布了ipcluster start --profile=mpi -n 6 &

现在看到我的引擎有6个日志文件，而我的控制器有1个：

drwxr-xr-x 2 gms gms 12288 Mar  6 03:28 .
drwxr-xr-x 7 gms gms  4096 Mar  6 03:31 ..
-rw-r--r-- 1 gms gms  1313 Mar  6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms   598 Mar  6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms   598 Mar  6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4629.log

查看ipcontroller的日志，看起来只有一个引擎注册：

2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0

不应该注册6个引擎中的每一个吗？

引擎的2个日志看起来好像已经注册了：

2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file     u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:    
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1

而另一个注册了id 0

但是，其他4个引擎发出了超时错误：

2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds

嗯......我想我明天可能会尝试重新安装ipython。

编辑更新3

安装了ipython的冲突版本（看起来像通过apt-get和pip）。使用pip install ipython[all] ...

卸载并重新安装

编辑更新4

我希望有人发现这有用，我希望有人可以在某些方面权衡一下，以帮助澄清一些事情。

Anywho，我安装了一个virtualenv来处理孤立我的环境，我认为它看起来有些程度的成功。我在每个节点上启动'ipcluster start -n 4 --profile = mpi'，然后ssh回到我的头节点并运行一个测试脚本，首先调用ipcluster。以下输出：因此，它正在进行一些并行计算。

但是，当我运行查询所有节点的测试脚本时，我只得到头节点：

但是，如果我只是直接运行mpiexec命令，那么一切都很笨拙。

为了增加混乱，如果我查看节点上的进程，我会看到各种行为以表明它们正在协同工作：

我的日志中没有任何异常。为什么我没有在第二个测试脚本中返回节点（代码包含在这里:)：

# test_mpi.py
import os
import socket
from mpi4py import MPI

MPI = MPI.COMM_WORLD

print("{host}[{pid}]: {rank}/{size}".format(
    host=socket.gethostname(),
    pid=os.getpid(),
    rank=MPI.rank,
    size=MPI.size,
))

Answer 1

不知道为什么，但我重新创建了我的ipcluster_config.py文件，并再次添加了c.MPILauncher.mpi_args = [“ - machinefile”，“path_to_file / machinefile”]，这次它起作用 - 出于某些奇怪的原因。我可以发誓我之前已经有了这个，但是唉......

使用机器文件

1 个答案: