Question

我想在不同的GPU上并行启动两个不同的python脚本（tensorflow对象检测train.py和eval.py），当train.py完成时，杀死eval.py。

我有以下代码来并行启动两个子进程（How to terminate a python subprocess launched with shell=True）。但是子流程是在同一设备上启动的（我猜是为什么。我只是不知道如何在不同的设备上启动它们）。

start_train = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=0 train.py ...”

start_eval = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=1 eval.py ...”

commands = [start_train, start_eval]

procs = [subprocess.Popen(i, shell=True, stdout=subprocess.PIPE, preexec_fn=os.setsid) for i in commands]

在此之后，我不知道如何继续。我需要下面的东西吗？我应该使用p.communicate()来避免死锁吗？或者如果我只需要为其完成而只为train.py调用wait（）或communication（）就足够了。

for p in procs:
    p.wait() # I assume this command won’t affect the parallel running

然后我需要以某种方式使用以下命令。我不需要train.py的返回值，而只需要子进程的返回码。 Popen.returncode documentation wait（）和communication（）看起来需要返回码设置。我不知道该如何设置。我更喜欢

if train is done without any error:
    os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM) 
else:
    write the error to the console, or to a file (but how?)

还是？

train_return = proc[0].wait() 
if train_return == 0:
    os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM)

解决问题后进行更新：

这是我的主要爱好

if __name__ == "__main__":
    exp = 1
    go = True
    while go:


        create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'train'))
        create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'eval'))


        copy_tree(os.path.join(MAIN_PATH,"kitti/eval_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"eval"))
        copy_tree(os.path.join(MAIN_PATH,"kitti/train_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"train"))

        err_log = open('./kitti/'+str(exp)+'/error_log' + str(exp) + '.txt', 'w')

        train_command = CUDA_COMMAND_PREFIX + "0 python3 " + str(MAIN_PATH) + "legacy/train.py \
                                            --logtostderr --train_dir " + str(MAIN_PATH) + "kitti/" \
                                            + str(exp) + "/train/ --pipeline_config_path " + str(MAIN_PATH) \
                                            + "kitti/faster_rcnn_resnet101_coco.config"
        eval_command = CUDA_COMMAND_PREFIX + "1 python3 " + str(MAIN_PATH) + "legacy/eval.py \
                                            --logtostderr --eval_dir " + str(MAIN_PATH) + "kitti/" \
                                            + str(exp) + "/eval/ --pipeline_config_path " + str(MAIN_PATH) \
                                            + "kitti/faster_rcnn_resnet101_coco.config --checkpoint_dir " + \
                                            str(MAIN_PATH) + "kitti/" + str(exp) + "/train/"

        os.system("python3 dataset_tools/random_sampler_with_replacement.py --random_set_id " + str(exp))
        time.sleep(20)
        update_train_set(exp)



        train_proc = subprocess.Popen(train_command,
                                  stdout=subprocess.PIPE,
                                  stderr=err_log, # write errors to a file
                                  shell=True)
        time.sleep(20)      
        eval_proc = subprocess.Popen(eval_command,
                                 stdout=subprocess.PIPE,
                                 shell=True)
        time.sleep(20)

        if train_proc.wait() == 0: # successfull termination
            os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)

        clean_train_set(exp)
        time.sleep(20)
        exp += 1
        if exp == 51:
            go = False

Answer 1

默认情况下，即使您有多个GPU，TensorFlow也会将操作分配给“ / gpu：0”（或“ / cpu：0”）。解决此问题的唯一方法是使用上下文管理器

将每个操作手动分配给您的一个脚本中的第二个GPU

with tf.device("/gpu:1"):
    # your ops here

更新

如果我对您的理解正确，那么您需要执行以下操作：

import subprocess
import os
err_log = open('error_log.txt', 'w')
train_proc = subprocess.Popen(start_train,
                              stdout=subprocess.PIPE,
                              stderr=err_log, # write errors to a file
                              shell=True)
eval_proc = subprocess.Popen(start_eval,
                             stdout=subprocess.PIPE,
                             shell=True)

if train_proc.wait() == 0: # successfull termination
    os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)
# else, errors will be written to the 'err_log.txt' file

并行启动两个脚本，然后根据另一个的返回停止一个脚本

1 个答案: