当我使用tensorflow-gpu执行代码时,标题出现错误。每个包含卷积层的代码都会发生此错误。
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:01:00.0 On | N/A |
| 0% 46C P8 21W / 215W | 568MiB / 7949MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1733 G /usr/lib/xorg/Xorg 18MiB |
| 0 1771 G /usr/bin/gnome-shell 57MiB |
| 0 2698 G /usr/lib/xorg/Xorg 175MiB |
| 0 2813 G /usr/bin/gnome-shell 168MiB |
| 0 3339 G ...uest-channel-token=11703333986562712743 76MiB |
| 0 8579 G /proc/self/exe 67MiB |
+-----------------------------------------------------------------------------+
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/bin
CUDA_PATH=/usr/local/cuda-10.0
LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/lib64
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH "
export PATH="/usr/local/cuda/bin:$PATH"
2019-06-29 23:13:22.132275: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-06-29 23:13:22.803064: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-29 23:13:22.805965: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "train.py", line 90, in <module>
main(args)
File "train.py", line 81, in main
callbacks=[callback]
File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1426, in fit_generator
initial_epoch=initial_epoch)
File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training_generator.py", line 191, in model_iteration
batch_outs = batch_function(*batch_data)
File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1191, in train_on_batch
outputs = self._fit_function(ins) # pylint: disable=not-callable
File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
run_metadata=self.run_metadata)
File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
run_metadata_ptr)
File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node block1_conv1/Conv2D}}]]
[[{{node loss/arc_face_loss/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat}}]]
它说“无法创建cudnn句柄:CUDNN_STATUS_INTERNAL_ERROR”,因此我估计是由CuDNN引起的。我尝试了诸如this question中的sudo rm -rf ~/.nv/
和this GitHub issue中的config.gpu_options.allow_growth = True
之类的方法,但是我无法解决。
请告诉我这个问题的解决方法。
谢谢。
答案 0 :(得分:-1)
尝试此代码
physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
对我有用