Question

我想训练一个 Tensorflow 模型 使用我的 GPU

我正在使用：

tensorboard                        2.4.1
tensorboard-plugin-wit             1.8.0
tensorflow-estimator               2.4.0
tensorflow-gpu                     2.4.1
cuda                               11.0
cdnn                               8.0.4
gpu                                RTX 3060 Laptop 6Gb
Nvidia FrameView SDK               1.1.4923.29548709
Nvidia Graphics Drivers            461.72
Nvidia PhysX                       9.19.0218
Python                             3.8.5
IDE                                Spyder 4.2.1
OS                                 Windows 10 LTSC-2019 (modified)

在发布此帮助之前我做了什么？

1/ 我已经安装了 Nvidia 显卡驱动

2/ 我遵循了这个 Tensorflow 教程：https://www.tensorflow.org/install/gpu

所以我从 C:\tools\ 中的 cdnn 下载存档复制了 cuda 文件夹

我还添加了路径所需的所有变量

3/ 尝试训练我的模型（如果我使用 CPU 则一切正常）：

with tf.device("/GPU:0"):
    history = model.fit(images, imagesID, epochs=50, validation_split=0.2)

错误：

2021-03-14 15:07:16.145096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-03-14 15:07:16.145335: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
2021-03-14 15:07:16.146411: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-03-14 15:07:16.146595: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
2021-03-14 15:07:16.146845: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_fused_impl.h:697 : Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

所以我在互联网上找到了这个：https://github.com/tensorflow/tensorflow/issues/45779

因此，我在顶部实现了此代码以限制 GPU 内存：

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    print(e)

错误：

Physical devices cannot be modified after being initialized

所以我发现了这个：https://github.com/tensorflow/tensorflow/issues/25138

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

但我仍然有同样的错误：

2021-03-14 15:07:16.145096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
...

我完全迷失了，因为我对 Tensorflow-GPU 错误缺乏了解...

所有日志的详细信息在这里：https://pastebin.com/Xtsv3mLe

我不太会写帖子，希望我说的够清楚。

提前谢谢你！！

Answer 1

您需要 Cuda 11.0 而不是 11.1。您可以在 https://www.tensorflow.org/install/gpu 处获得有关您需要的更多信息。这可能比阅读安装指南更有帮助，尽管您应该https://alejandro-gc.medium.com/setting-up-your-gpu-for-tensorflow-2-4-2021-d98cac79a686。

Answer 2

您的 gpu 是 rtx 3060。也许..您可以尝试使用 Cuda 11.1、cudnn 8.0.5。 //

我也在我的桌面上使用 rtx 3060。它起作用了。

GPU 问题 Tensorflow 2.4.1

2 个答案: