我设置了一个IPython并行ipcluster
来使用Sun Grid Engine,看起来效果很好:
ipcluster start -n 100 --profile=sge
2016-07-15 14:47:09.749 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-07-15 14:47:09.751 [IPClusterStart] Creating pid file: /home/USERNAME/.ipython/profile_sge/pid/ipcluster.pid
2016-07-15 14:47:09.751 [IPClusterStart] Starting Controller with SGEControllerLauncher
2016-07-15 14:47:09.789 [IPClusterStart] Job submitted with job id: u'6354583'
2016-07-15 14:47:10.790 [IPClusterStart] Starting 100 Engines with SGEEngineSetLauncher
2016-07-15 14:47:10.826 [IPClusterStart] Job submitted with job id: u'6354584'
2016-07-15 14:47:40.856 [IPClusterStart] Engines appear to have started successfully
然后我使用
从笔记本连接 rc = ipp.Client(profile='sge')
但是当我使用并行魔法
时%%px
from mpi4py import MPI
comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
rank = comm.Get_rank()
print('I am #{} of {} and run on {}'.format(rank,nprocs,MPI.Get_processor_name()))
我所有进程只返回rank 0
:
[stdout:0] I am #0 of 1 and run on compute-8-13.local
[stdout:1] I am #0 of 1 and run on compute-8-13.local
[stdout:2] I am #0 of 1 and run on compute-3-3.local
[stdout:3] I am #0 of 1 and run on compute-3-3.local
[stdout:4] I am #0 of 1 and run on compute-3-3.local
...
以下是我的设置脚本:
ipcluster_config.py
:
c.IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'
c.IPClusterStart.controller_launcher_class = 'SGEControllerLauncher'
c.SlurmEngineSetLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.engine.template'
c.SlurmControllerLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.controller.template'
ipcontroller_config.py
:
c.HubFactory.ip = '*'
sge.controller.template
# /bin/sh
#$ -S /bin/sh
#$ -pe orte 1
#$ -q sThC.q
#$ -cwd
#$ -N ipyparallel_controller
#$ -o ipyparallel_controller.log
#$ -e ipyparallel_controller.err
module load gcc/5.3/openmpi
source activate parallel
ipcontroller --profile-dir={profile_dir}
sge.engine.template
# /bin/sh
#$ -S /bin/sh
#$ -pe orte {n}
#$ -q sThC.q
#$ -cwd
#$ -N ipyparallel_engines
#$ -o ipyparallel_engines.log
#$ -e ipyparallel_engines.err
module load gcc/5.3/openmpi
source activate parallel
mpiexec -n {n} ipengine --profile-dir={profile_dir} --timeout=30
答案 0 :(得分:0)
自己找到解决方案/错误:
在ipcluster_config.py
中的,我忘了重命名Slurm
- > SGE
的一些案例,所以它应该是
c.IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'
c.IPClusterStart.controller_launcher_class = 'SGEControllerLauncher'
c.SGEEngineSetLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.engine.template'
c.SGEControllerLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.controller.template'
此导致ipcluster
使用某种默认的SGE模板,提交100个单独的作业,而不是100个进程的一个作业。
现在我得到了所希望的:
[stdout:0] I am #5 of 100 and run on compute-5-17.local
[stdout:1] I am #9 of 100 and run on compute-5-17.local
[stdout:2] I am #1 of 100 and run on compute-5-17.local
[stdout:3] I am #7 of 100 and run on compute-5-17.local
[stdout:4] I am #2 of 100 and run on compute-5-17.local
...