Question

我已经安装了 TensorFlow、cudnn 和 cuda 以及使用 GPU 运行 TensorFlow 所需的一切。当代码运行时，它退出第一个纪元，错误代码为 3221226505。我不知道为什么会发生这种情况，在 google collab 上运行该程序可以正常工作，并且只占用大约 1 GB 的 GPU 内存。

蟒蛇 - 3.8.6， CUDA - 11.2.1， cudnn - 11.2

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
import pandas as pd 
import time
import tensorflow as tf

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)
data = pd.read_csv("hmnist_28_28_RGB.csv") 

X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]

X = X / 255.0
X = X.values.reshape(-1,28,28,3)

y = tf.keras.utils.to_categorical(y.values,7)

print(y.shape)
print(X.shape)

model = Sequential()
model.add(Conv2D(256, (3, 3), input_shape=X.shape[1:]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(256, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(7))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X, y, batch_size=5, epochs=100, validation_split=0.3,verbose=1)

2021-02-17 16:16:20.046679: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-17 16:16:20.047795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-02-17 16:16:20.073694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 3060 Ti computeCapability: 8.6
coreClock: 1.665GHz coreCount: 38 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-02-17 16:16:20.074002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-02-17 16:16:20.090128: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-02-17 16:16:20.090262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-02-17 16:16:20.094202: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-02-17 16:16:20.095533: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-02-17 16:16:20.100795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-02-17 16:16:20.104097: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-02-17 16:16:20.104808: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-02-17 16:16:20.105102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-17 16:16:20.105825: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-17 16:16:20.107415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 3060 Ti computeCapability: 8.6
coreClock: 1.665GHz coreCount: 38 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-02-17 16:16:20.107754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-02-17 16:16:20.107967: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-02-17 16:16:20.108188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-02-17 16:16:20.108410: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-02-17 16:16:20.108609: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-02-17 16:16:20.108798: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-02-17 16:16:20.108900: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-02-17 16:16:20.109119: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-02-17 16:16:20.109366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-17 16:16:20.583958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-17 16:16:20.584082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-02-17 16:16:20.584220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-02-17 16:16:20.584499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1024 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060 Ti, pci bus id: 0000:09:00.0, compute capability: 8.6)
2021-02-17 16:16:20.585568: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
1 Physical GPUs, 1 Logical GPUs
(10015, 7)
(10015, 28, 28, 3)
2021-02-17 16:16:23.006535: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/100
2021-02-17 16:16:23.409332: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-02-17 16:16:24.173684: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-02-17 16:16:24.177694: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll

[Done] exited with code=3221226505 in 7.113 seconds

Answer 1

我遇到了同样的问题好几天了，终于设法解决了，这似乎是一个 library(microbenchmark) foo1 <- function(x) { helper <- function(x) x^2 2 * helper(x) } helper <- function(x) x^2 foo2 <- function(x) { 2 * helper(x) } microbenchmark( inside = foo1(1:1000), outside = foo2(1:1000), times = 1000 ) 问题。

首先我在 Windows Power Shell 中运行程序，就像这样（在我的例子中）：

.dll

我注意到，在最后三行 python -u "D:\Documents\Python\test.py" 之后，出现了一个新错误，指出它无法使用 "Successfully opened dynamic library..." 加载库（特别是 cudnn_adv_train64_8.dll）。

然后我只是复制了 cudnn Error 126 文件夹中的所有文件：

bin

并将它们粘贴到 CUDA cudnn_adv_infer64_8.dll cudnn_adv_train64_8.dll cudnn_cnn_infer64_8.dll cudnn_cnn_train64_8.dll cudnn_ops_infer64_8.dll cudnn_ops_train64_8.dll cudnn64_8.dll 文件夹中（在我的例子中）：

bin

现在它似乎运行良好，而且肯定使用了我的 GPU，因为训练时间从 2-3 分钟缩短到 20 秒。

希望它也适用于您。

使用代码 3221226505 在 epoch 1 退出的 GPU CNN 训练

1 个答案: