我想编写一个可以在Slurm群集的两个计算节点上运行并行进程的python脚本。
我尝试过pexpect.spawn
node_list = {'cn000': 28, 'cn001': 28}
children, pid = [], 0
for node, ntasks in nnode_list.items():
pid_range = range(pid, pid + ntasks)
pid += ntasks
for worker in pid_range:
chile = pexpect.spawn(worker_command % worker)
children.append(child)
但是,运行脚本后,建立的连接(由lsof -i | grep port_number
命名)都是计算节点之一,两个计算节点之间没有建立连接。
所以我尝试pexpect.pxssh
来解决问题,我尝试的脚本是
from pexpect import pxssh
from socket import gethostname
from getpass import getuser
connections, pid = [], 0
for node, ntasks in node_list.items():
pid_range = range(pid, pid + ntasks)
pid += ntasks
if node == gethostname():
children = []
for worker in pid_range:
child = pexpect.spawn(worker_command % worker)
children.append(child)
connections.append(children)
if node != gethostname():
ssh = pxssh.pxssh()
ssh.login(node, getuser())
for worker in pid_range:
ssh.sendline(worker_command % worker)
connections.append(ssh)
不幸的是,尽管ssh.sendline
发送了命令行,但是进程没有启动,但这不能解决问题。
我知道srun
也可以用于编写可以运行并行进程的脚本,如何同步由srun
启动的进程?