我正在尝试在多个CUDA设备上分配作业,其中任何时候运行的作业总数应小于或等于可用的cpu核心数。为此,我确定了可用的插槽的数量'在每个设备上创建一个包含可用插槽的列表。如果我有6个cpu核心和两个cuda设备(0和1),那么AVAILABLE_SLOTS = [0,1,0,1,0,1]。在我的worker函数中,我弹出列表并将其保存到变量中,在子进程调用中设置CUDA_VISIBLE_DEVICES env var,然后将其追加到列表中。到目前为止,这一直有效,但我想避免竞争条件。
目前的代码如下:
def work(cmd):
slot = AVAILABLE_GPU_SLOTS.pop()
exit_code = subprocess.call(cmd, shell=False, env=dict(os.environ, CUDA_VISIBLE_DEVICES=str(slot)))
AVAILABLE_GPU_SLOTS.append(slot)
return exit_code
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
mols_to_be_run = [name for name in os.listdir(YANK_FILES) if os.path.isdir(os.path.join(YANK_FILES, name))]
cmds = build_cmd(mols_to_be_run)
cuda = get_cuda_devices()
AVAILABLE_GPU_SLOTS = build_available_gpu_slots(pool_size, cuda)
pool = multiprocessing.Pool(processes=pool_size, maxtasksperchild=2, )
pool.map(work, cmds)
我可以在与AVAILABLE_GPU_SLOTS相同的级别声明lock = multiprocessing.Lock(),将其放入cmds,然后在work()中执行
with lock:
slot = AVAILABLE_GPU_SLOTS.pop()
# subprocess stuff
with lock:
AVAILABLE_GPU_SLOTS.append(slot)
还是需要经理列表?或者也许可以更好地解决我正在做的事情。
答案 0 :(得分:0)
基于我在以下SO回答Python sharing a lock between processes中找到的内容:
使用常规列表会导致每个进程都有自己的副本,如预期的那样。使用经理列表似乎足以解决这个问题。示例代码:
def doing_work(honk):
proc = multiprocessing.current_process()
# with lock:
# print proc, 'about to pop SLOTS_LIST', SLOTS_LIST
# slot = SLOTS_LIST.pop()
# print multiprocessing.current_process(), ' just popped', slot, 'from', SLOTS_LIST
print proc, 'about to pop SLOTS_LIST', SLOTS_LIST
slot = SLOTS_LIST.pop()
print multiprocessing.current_process(), ' just popped', slot, 'from SLOTS_LIST'
time.sleep(10)
def init(l):
global lock
lock = l
if __name__ == '__main__':
man = multiprocessing.Manager()
SLOTS_LIST = [1,34,3465,456,4675,6,4]
SLOTS_LIST = man.list(SLOTS_LIST)
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=2, initializer=init, initargs=(l,))
inputs = range(len(SLOTS_LIST))
pool.map(doing_work, inputs)
输出
<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456, 4675, 6, 4]
<Process(PoolWorker-3, started daemon)> just popped 4 from SLOTS_LIST
<Process(PoolWorker-2, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456, 4675, 6]
<Process(PoolWorker-2, started daemon)> just popped 6 from SLOTS_LIST
<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456, 4675]
<Process(PoolWorker-3, started daemon)> just popped 4675 from SLOTS_LIST
<Process(PoolWorker-2, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456]
<Process(PoolWorker-2, started daemon)> just popped 456 from SLOTS_LIST
<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1, 34, 3465]
<Process(PoolWorker-3, started daemon)> just popped 3465 from SLOTS_LIST
<Process(PoolWorker-2, started daemon)> about to pop SLOTS_LIST [1, 34]
<Process(PoolWorker-2, started daemon)> just popped 34 from SLOTS_LIST
<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1]
<Process(PoolWorker-3, started daemon)> just popped 1 from SLOTS_LIST
这是期望的行为。我不确定它是否完全消除了竞争条件,但它似乎已经足够好了。那并且在它上面使用锁是很简单的。