我目前正在使用Tensorflow作为后端与Keras一起开发NN。 在此阶段,我仍在寻找最佳模型,并正在执行k折交叉验证。我已经在Google云上租用了带有GPU的虚拟机(我正在使用Tesla P100)。 我的问题是,我试图在单个GPU内针对一个模型的折叠并行化训练,而我想对在多个GPU上的不同模型进行并行化训练。当我使用一个GPU时,我可以在该GPU内启动多达4个进程,当我使用相同的代码添加第二个进程时,每个GPU都可以在折叠上连续工作。 我正在使用的代码如下
def gen_model(n_folds_to_do_in_parallel, ...):
# Divide single GPU resources for the folds to do in parallel
import tensorflow as tf
gpu_mem_fraction = 0.85 / n_folds_to_do_in_parallel
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_mem_fraction)
tf.keras.backend.set_session(tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)))
# Create and compile the model with Keras
...
return model
def train_model_fold(args):
n_folds_to_do_in_parallel, ... = args
model = gen_model(n_folds_to_do_in_parallel, ...)
model.fit(...)
...
return results
def do_cross_validation(args):
n_folds, n_folds_to_do_in_parallel, ...
...
args_train = [(n_folds_to_do_in_parallel, ...) for fold in n_folds]
res = Parallel(n_jobs=n_folds_to_do_in_parallel, verbose=1, backend='multiprocessing')(map(delayed(train_model_fold), args_train))
...
return res
def do_cross_validation_wrapper(args)
# Launch the models for one GPU in sequence fixing the available GPU
gpu_id, model_args_list, ... = args
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
tmp_res = []
for model_args in model_args_list:
single_res = do_cross_validation(model_args)
tmp_res.append(single_res)
return tmp_res
def select_model(models_list, n_gpus, ...)
# Create arguments for each model
model_args_list = [(...) for model in models_list]
# Split args for GPUs
wrap_args = [(i, [model_args_list[j] for j in range(len(model_args_list)) if (j % n_gpus) == i]) for i in range(n_gpus)]
all_res = Parallel(n_jobs=n_gpus, verbose=1, backend='multiprocessing')(map(delayed(do_cross_validation_wrapper), wrap_args))
...
# Select best model results
...
return
def main()
select_model(...)
return 0
if __name__ == '__main__':
main()
如何连接多个GPU并解决多个问题?