无法在tensorflow-gpu上使用GPU:“无法创建cudnn句柄:CUDNN_STATUS_INTERNAL_ERROR”

时间:2019-06-29 14:54:25

标签: python tensorflow gpu cudnn

我的问题摘要

当我使用tensorflow-gpu执行代码时,标题出现错误。每个包含卷积层的代码都会发生此错误。

环境

  • Ubuntu 18.04
  • Python 3.7.1
  • tensorflow-gpu 1.13.1
  • CUDA 10.1
  • CuDNN 7.4.2

有关GPU的详细信息

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   46C    P8    21W / 215W |    568MiB /  7949MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1733      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1771      G   /usr/bin/gnome-shell                          57MiB |
|    0      2698      G   /usr/lib/xorg/Xorg                           175MiB |
|    0      2813      G   /usr/bin/gnome-shell                         168MiB |
|    0      3339      G   ...uest-channel-token=11703333986562712743    76MiB |
|    0      8579      G   /proc/self/exe                                67MiB |
+-----------------------------------------------------------------------------+

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/bin
CUDA_PATH=/usr/local/cuda-10.0
LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/lib64
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH    "
export PATH="/usr/local/cuda/bin:$PATH"

整个错误消息

2019-06-29 23:13:22.132275: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-06-29 23:13:22.803064: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-29 23:13:22.805965: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "train.py", line 90, in <module>
    main(args)
  File "train.py", line 81, in main
    callbacks=[callback]
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1426, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training_generator.py", line 191, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1191, in train_on_batch
    outputs = self._fit_function(ins)  # pylint: disable=not-callable
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node block1_conv1/Conv2D}}]]
     [[{{node loss/arc_face_loss/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat}}]]

它说“无法创建cudnn句柄:CUDNN_STATUS_INTERNAL_ERROR”,因此我估计是由CuDNN引起的。我尝试了诸如this question中的sudo rm -rf ~/.nv/this GitHub issue中的config.gpu_options.allow_growth = True之类的方法,但是我无法解决。

请告诉我这个问题的解决方法。

谢谢。

1 个答案:

答案 0 :(得分:-1)

尝试此代码

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

对我有用