使用HTCondor调度程序进行操作

时间：2018-11-26 20:31:49

标签： python parallel-processing dask condor

背景

我有一个具有并行步骤的图像分析管道。流水线位于python中，并行化由dask.distributed控制。最低处理设置为1个调度程序+ 3个工作程序，每个工作程序有15个进程。在分析的第一步中，我使用1个进程/工作人员，但是节点的所有RAM，然后在所有其他分析步骤中，使用了所有节点和进程。

问题

管理员将安装HTCondor作为群集的调度程序。

思想

为了使我的代码在新设置上运行，我计划使用dask manual for SGE中显示的方法，因为集群具有共享的网络文件系统。

# job1 
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json

# Job2
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json

# Job3 
# Start a process with the python code where the client is started this way
client = Client(scheduler_file='/path/to/scheduler.json')

问题和建议

如果我对这种方法的理解正确，我将以独立的工作（不同的HTCondor提交文件）形式启动调度程序，工作程序和分析。如何确保执行顺序正确？有没有办法可以使用以前使用的相同处理方法，或者将更有效地翻译代码以使其与HTCondor更好地配合使用？感谢您的帮助！

1 个答案:

答案 0 :(得分：0)

HTCondor JobQueue支持已合并（https://github.com/dask/dask-jobqueue/pull/245），现在应该在Dask JobQueue（HTCondorCluster(cores=1, memory='100MB', disk='100MB')）中可用