模型适合单个GPU,但是尝试适合多个GPU时脚本会崩溃

时间:2018-10-09 14:56:58

标签: python tensorflow keras

我有一个可以在单个GPU上进行良好训练的模型,但是当我尝试使用multi_gpu_model进行拟合时,在脚本退出之前我收到了CUDA错误:

F tensorflow/stream_executor/cuda/cuda_dnn.cc:521] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

我试图将模型实例的已编译和未编译版本都传递给multi_gpu_model函数,但它没有任何改变。我这样称呼它:

multi_model = multi_gpu_model(model, gpus=4)

编译是通过这种方式完成的,不会引起任何错误:

multi_model.compile(
    optimizer=keras.optimizers.Adam(5e-4),
    loss=dice_coefficient_loss,
    metrics=[dice_coefficient]
            + get_label_wise_dice_coefficient_functions(n_labels))

def dice_coefficient(y_true, y_pred, smooth=1.):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return ((2. * intersection + smooth)
            / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth))


def dice_coefficient_loss(y_true, y_pred):
    return -dice_coefficient(y_true, y_pred)


def label_wise_dice_coefficient(y_true, y_pred, label_index):
    return dice_coefficient(y_true[:, label_index], y_pred[:, label_index])


def get_label_dice_coefficient_function(label_index):
    f = functools.partial(label_wise_dice_coefficient, label_index=label_index)
    f.__setattr__('__name__', 'label_{0}_dice_coef'.format(label_index))
    return f


def get_label_wise_dice_coefficient_functions(n_labels):
    return [get_label_dice_coefficient_function(i) for i in range(n_labels)]

(这些功能和模型体系结构中的大多数已被盗here

我正在使用python 3.6.6,tensorflow-gpu 1.10.0,cudatoolkit 9.2,conda主存储库中的cudnn 7.2.1和keras-contrib 2.0.8,在64位CentOS 7.4上安装了pip / git .1708

看看前面的日志行,似乎可以正确检测到多个GPU:

2018-10-09 16:30:19.977993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:20:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.318137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:21:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.595428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:22:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.953619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:23:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.967429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2018-10-09 16:30:22.415906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-09 16:30:22.415957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 2 3
2018-10-09 16:30:22.415965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y Y Y
2018-10-09 16:30:22.415971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N Y Y
2018-10-09 16:30:22.415982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   Y Y N Y
2018-10-09 16:30:22.415988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3:   Y Y Y N
2018-10-09 16:30:22.416681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10393 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:20:00.0, compute capability: 6.1)
2018-10-09 16:30:22.536003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10393 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:21:00.0, compute capability: 6.1)
2018-10-09 16:30:22.637811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10393 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:22:00.0, compute capability: 6.1)
2018-10-09 16:30:22.747698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10393 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:23:00.0, compute capability: 6.1)
2018-10-09 16:30:25,557.557:__main__:INFO:Compiling model
2018-10-09 16:30:25,634.634:__main__:INFO:Fitting model
2018-10-09 16:31:31.773355: F tensorflow/stream_executor/cuda/cuda_dnn.cc:521] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
/bin/bash: line 1: 160691 Aborted

对于我做错了的任何帮助,将不胜感激。

1 个答案:

答案 0 :(得分:1)

事实证明,当数据集中的样本数量不是.fit()的倍数时,即multi_model_gpu的{​​{1}}方法不喜欢它。以我为例。从数据集中删除样本解决了我的问题。我报告了这个错误here