一周前一切都还好。
即使我在服务器上运行,我也认为并没有太大变化。
不知道是什么原因造成的。
Tensorflow的版本为2.1.0-dev20191015
无论如何,这是GPU状态:
NVIDIA-SMI 430.50
驱动程序版本:430.50
CUDA版本:10.1
Epoch 1/5
2019-11-29 22:08:00.334979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-29 22:08:00.644569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-29 22:08:00.647191: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-11-29 22:08:00.647309: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 430.50.0
2019-11-29 22:08:00.647347: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation.
2019-11-29 22:08:00.647393: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Fail to find the dnn implementation.
最后,我得到:
UnknownError: [_Derived_] Fail to find the dnn implementation.
[[{{node CudnnRNN}}]]
[[sequential/bidirectional/forward_lstm/StatefulPartitionedCall]] [Op:__inference_distributed_function_18158]
Function call stack:
distributed_function -> distributed_function -> distributed_function
代码可追溯到此处:
174 history = model.fit(training_input, training_output, epochs=EPOCHES,
175 batch_size=BATCH_SIZE,
--> 176 validation_split=0.1)
谢谢。
答案 0 :(得分:1)
确实存在系统范围的升级。 将cuda更新为cuda 10.2,将nvidia-driver更新为440,并使libcudnn7 7.6.5解决了该问题。