Question

在通过一些预先训练的模型进行对象分类时遇到了麻烦。这段代码可在ResNet和Inception上运行，但是当我使用VGG16或VGG19时，却发现cudnn出现了一些问题。

我在tensorflow-gpu = 2.2.0，cuda = 10.1，cudnn = 7.6.5。的conda虚拟环境中运行代码。

我的操作系统的Cudnn是8.0.4。这可能是个问题吗？？？我在使用该系统的许多模型上工作都很好，但在这种情况下却不行。

这是我的代码：

ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
    help="path to the input image")
ap.add_argument("-model", "--model", type=str, default="vgg16",
    help="name of pre-trained network to use")
args = vars(ap.parse_args())

MODELS = {
    "vgg16": VGG16,
    "vgg19": VGG19,
    "inception": InceptionV3,
    "xception": Xception, # TensorFlow ONLY
    "resnet": ResNet50
}

if args["model"] not in MODELS.keys():
    raise AssertionError("The --model command line argument should "
        "be a key in the `MODELS` dictionary")
    
inputShape = (224, 224)
preprocess = imagenet_utils.preprocess_input

if args["model"] in ("inception", "xception"):
    inputShape = (299, 299)
    preprocess = preprocess_input
    

Network = MODELS[args["model"]]
model = Network(weights="imagenet")
#model = Network()
model.summary()

image = load_img(args["image"], target_size=inputShape)
image = img_to_array(image)

image = np.expand_dims(image, axis=0)
image = preprocess(image)


preds = model.predict(image)
P = imagenet_utils.decode_predictions(preds)

for (i, (imagenetID, label, prob)) in enumerate(P[0]):
    print("{}. {}: {:.2f}%".format(i + 1, label, prob * 100))

以下是日志消息：

2020-11-08 11:14:31.324751: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-11-08 11:14:31.334392: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "Classify_keras_applications.py", line 92, in <module>
    preds = model.predict(image)
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 88, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1268, in predict
    tmp_batch_outputs = predict_function(iterator)
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 650, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
    return self._call_flat(
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/home/phat/anaconda3/envs/DL/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node vgg19/block1_conv1/Conv2D (defined at Classify_keras_applications.py:92) ]] [Op:__inference_predict_function_763]

Function call stack:
predict_function

Answer 1

您是否已检查以下问题：https://github.com/tensorflow/tensorflow/issues/34888

他们提到要将此代码添加到您的代码顶部：

 import tensorflow as tf
 gpus= tf.config.experimental.list_physical_devices('GPU')
 tf.config.experimental.set_memory_growth(gpus[0], True)

这不会立即分配GPU的所有内存，但是会随着模型的增长而增加。但是，我敢打赌VGGx不适合您的GPU内存，即使有了这些额外的代码，我也不认为它会适合。

作为参考，请检查以下doc：

VGG16：528 MB
VGG19：549 MB

并且：

ResNet50：98MB
InceptionV3：92MB

VGGx的尺寸是其他尺寸的5倍

预训练模型在ResNet，InceptionNet上运行良好，但无法在VGG16和VGG19上运行

1 个答案: