在先前的工作系统上遇到“无法创建cudnn句柄:CUDNN_STATUS_NOT_INITIALIZED”

时间:2019-11-30 03:20:43

标签: python tensorflow

一周前一切都还好。 即使我在服务器上运行,我也认为并没有太大变化。 不知道是什么原因造成的。 Tensorflow的版本为2.1.0-dev20191015

无论如何,这是GPU状态:

NVIDIA-SMI 430.50
驱动程序版本:430.50
CUDA版本:10.1

Epoch 1/5 2019-11-29 22:08:00.334979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-11-29 22:08:00.644569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-11-29 22:08:00.647191: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED 2019-11-29 22:08:00.647309: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 430.50.0 2019-11-29 22:08:00.647347: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation. 2019-11-29 22:08:00.647393: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Fail to find the dnn implementation.

最后,我得到:

UnknownError: [_Derived_] Fail to find the dnn implementation. [[{{node CudnnRNN}}]] [[sequential/bidirectional/forward_lstm/StatefulPartitionedCall]] [Op:__inference_distributed_function_18158] Function call stack: distributed_function -> distributed_function -> distributed_function

代码可追溯到此处:

174 history = model.fit(training_input, training_output, epochs=EPOCHES, 175 batch_size=BATCH_SIZE, --> 176 validation_split=0.1)

谢谢。

1 个答案:

答案 0 :(得分:1)

确实存在系统范围的升级。 将cuda更新为cuda 10.2,将nvidia-driver更新为440,并使libcudnn7 7.6.5解决了该问题。