tensorflow-gpu 2.0中的CUDNN_STATUS_INTERNAL_ERROR

时间:2019-10-16 08:05:09

标签: tensorflow nvidia cudnn

在Tensorflow 2.0中运行CNN时,出现CUDNN_STATUS_INTERNAL_ERROR。 看来libcublas.so.10.0和libcudnn.so.7加载正常。

版本应该没问题:

  • Tensorflow 2.0
  • ubuntu 18.04
  • GeForce GTX 1650
  • NVIDIA驱动程序430
  • cudnn:7.4.2.24(也尝试使用7.3.0.29和7.6.4.38) (ref

我尝试了以下方法,但他们没有解决问题:

  1. 我删除了〜/ .nv(ref
  2. 将/usr/include/cudnn.h #include "driver_types.h"修改为#include <driver_types.h>,并通过mnistCUDNN测试(ref

问题:

  1. 通过mnistCUDNN测试是否意味着正确安装了必需的软件包?
  2. 如何在下面解决此问题?

毕竟,这是错误消息:

Using TensorFlow backend.
2019-10-16 14:48:16.226892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44]  Successfully opened dynamic library libcuda.so.1
2019-10-16 14:48:16.255123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006]    successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
...
2019-10-16 14:48:16.370703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
Train on 48000 samples, validate on 12000 samples
Epoch 1/12
2019-10-16 14:48:17.357747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-16 14:48:17.525865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
--error here--
2019-10-16 14:48:17.873127: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-16 14:48:17.879412: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
--error here--
2019-10-16 14:48:17.879516: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
 [[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
File "lenet.py", line 96, in <module> x_train, y_train, batch_size=128, epochs=12, validation_split=0.2
File "lenet.py", line 83, in train verbose=self.verbose
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit validation_freq=validation_freq)
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop outs = fit_function(ins_batch)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3740, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
return self._call_impl(args, kwargs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
 [[node conv2d_1/convolution (defined at /home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_1220]

Function call stack:
keras_scratch_graph

1 个答案:

答案 0 :(得分:0)

我在Ubuntu 20.04 / RTX 2070系统上遇到此错误。我发现了:

https://gist.github.com/mikaelhg/cae5b7938aa3dfdf3d06a40739f2f3f4#file-cuda-install-md

建议导出如下环境变量:

export TF_FORCE_GPU_ALLOW_GROWTH=true

那为我解决了。快乐的日子。