Question

运行训练脚本会导致GPU优化的Ubuntu计算机出现内存错误。该错误看起来很可疑，因为机器有足够的内存来运行算法。

这是错误：

TensorFlow：尝试分配16.0KiB时内存不足

记忆状态：

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15038         190        6580           8        8267       14670
Swap:             0           0           0

错误控制台：

$ python ./train.py --run --continue
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Loading data..
Number of categories: 2
Number of samples 425
/home/ubuntu/DeepClassificationBot-master/data.py:134: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  val = np.random.choice(dataset_indx, size=number_of_samples)
/home/ubuntu/DeepClassificationBot-master/data.py:127: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  train = np.random.choice(dataset_indx, size=number_of_samples)
Building and Compiling model..
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Training..
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 16.0KiB was 16.0KiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702580000 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702581b00 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583800 of size 147456
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7025a7800 of size 147456
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7025cb800 of size 256
....... Very long list of chunks
I tensorflow/core/common_runtime/bfc_allocator.cc:689]      Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 115 Chunks of size 256 totalling 28.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 34 Chunks of size 512 totalling 17.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 21 Chunks of size 1024 totalling 21.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 42 Chunks of size 2048 totalling 84.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 6912 totalling 47.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 42 Chunks of size 16384 totalling 672.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 32768 totalling 160.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 147456 totalling 1008.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 294912 totalling 1.97MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 589824 totalling 3.94MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 1179648 totalling 7.88MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 14 Chunks of size 2359296 totalling 31.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 4718592 totalling 31.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 35 Chunks of size 9437184 totalling 315.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 67108864 totalling 320.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 6 Chunks of size 411041792 totalling 2.30GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 663988224 totalling 633.23MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 3.61GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: 
Limit:                  3878682624
InUse:                  3878682624
MaxInUse:               3878682624
NumAllocs:                     362
MaxAllocSize:            663988224

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************************************************************************************xxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 16.0KiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:930] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized.
     [[Node: Variable_91/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [4096] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
  File "./train.py", line 154, in <module>
    run(extract=extract_mode, cont=continue_)
  File "./train.py", line 104, in run
    sample_weight=None)
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit
    self._make_train_function()
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function
    training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss)
  File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 314, in get_updates
    vs = [K.variable(np.zeros(K.get_value(p).shape)) for p in params]
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 78, in variable
    get_session().run(v.initializer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 710, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 908, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 958, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 978, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Dst tensor is not initialized.
     [[Node: Variable_91/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [4096] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'Variable_91/initial_value', defined at:
  File "./train.py", line 154, in <module>
    run(extract=extract_mode, cont=continue_)
  File "./train.py", line 104, in run
    sample_weight=None)
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit
    self._make_train_function()
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function
    training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss)
  File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 314, in get_updates
    vs = [K.variable(np.zeros(K.get_value(p).shape)) for p in params]
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 75, in variable
    v = tf.Variable(np.asarray(value, dtype=dtype), name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 211, in __init__
    dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 289, in _init_from_args
    dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 628, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 180, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 167, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2317, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1239, in __init__
    self._traceback = _extract_stack()

这样可行但仍会发出内存警告：

$ python deploy.py --URL http://www.example.com/image.jpg
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
1/1 [==============================] - 2s
______________________________________________
Image Name: image.jpg
Categories: 
1. shrek 100.00%
2. darth vader 0.00%

Answer 1

我发布了this answer on another question，但我想我会在这里发布以获取可见性。我不确定你的模型的大小，但在我的情况下，我运行的是一个较小的模型，并且有点惊讶，这么小的模型给出了OOM错误。

具体来说，我在GTX 970上训练一个小型CNN时遇到内存不足错误。通过一些侥幸，我发现告诉TensorFlow根据需要在GPU上分配内存（而不是在前面）解决所有问题我的问题。这可以使用以下Python代码完成：

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)

以前，TensorFlow会预先分配~90％的GPU内存。但是，由于某种未知的原因，当我增加网络的大小时，这将导致内存不足错误。通过使用上面的代码，我不再有OOM错误。

TensorFlow：尝试分配16.0KiB时内存不足

1 个答案: