Question

我正在尝试使用Tensorflow-GPU 2.0版本运行一个简单的Tensorflow代码和回调函数。它将gpu抛出内存不足错误。请指教。

import tensorflow as tf

class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    if(logs.get('acc')>0.6):
      print("\nReached 60% accuracy so cancelling training!")
      self.model.stop_training = True

mnist = tf.keras.datasets.fashion_mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

callbacks = myCallback()

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, callbacks=[callbacks])

错误信息低于-

name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
2019-10-29 18:31:36.464071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-10-29 18:31:36.464082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-29 18:31:36.464092: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-10-29 18:31:36.464102: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-10-29 18:31:36.464111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-10-29 18:31:36.464120: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-10-29 18:31:36.464130: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-29 18:31:36.464163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-29 18:31:36.464344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-29 18:31:36.464496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-10-29 18:31:36.464524: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-10-29 18:31:36.465091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-29 18:31:36.465100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-10-29 18:31:36.465104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-10-29 18:31:36.465165: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-29 18:31:36.465352: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-29 18:31:36.465525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 68 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-10-29 18:31:36.467240: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 68.31M (71630848 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-10-29 18:31:36.467629: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 61.48M (64467968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-10-29 18:31:36.469522: F ./tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory

Answer 1

尝试减少batch_size和：

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.7
session = tf.Session(config=config)

也将512替换为128：

tf.keras.layers.Dense(128, activation=tf.nn.relu)

Answer 2

重新启动整个系统。

Tensorflow GPU 2.0的NVIDIA RTX GPU卡内存不足

2 个答案: