GPU上的Tensorflow OOM

时间:2017-02-27 21:17:06

标签: tensorflow out-of-memory gpu vram

我在Tensorflow中训练LSTM-RNN上的一些音乐数据并遇到一些GPU内存分配的问题,我不明白:我遇到一个OOM,实际上似乎只是足够的VRAM仍然可用。 一些背景: 我正在使用UXntu Gnome 16.04,使用GTX1060 6GB,Intel Xeon E3-1231V3和8GB RAM。 所以现在首先是我能理解的错误消息的一部分,并且我将在最后为可能要求它提供帮助的任何人添加整个错误消息:


I tensorflow / core / common_runtime /] 8块   尺寸256总共2.0KiB I.   tensorflow / core / common_runtime /] 1块大小   1280总计1.2KiB I.   tensorflow / core / common_runtime /] 5块大小   44288总计216.2KiB I   tensorflow / core / common_runtime /] 5块大小   56064总计273.8KiB I   tensorflow / core / common_runtime /] 4块大小   154350080总计588.80MiB I.   tensorflow / core / common_runtime /] 3块大小   813400064总计2.27GiB I.   tensorflow / core / common_runtime /] 1块大小   1612612352总计1.50GiB I.   tensorflow / core / common_runtime /]总和   使用中的块:4.35GiB I.   tensorflow / core / common_runtime /]统计数据:












W tensorflow / core / common_runtime /]   ********************* ___________ * __ *********** ************************* xxxxxxxxxxxxxx W tensorflow / core / common_runtime /]跑出去   内存试图分配775.72MiB。查看内存状态的日志。 w ^   tensorflow / core / framework /]资源耗尽:OOM   当分配张量与形状[14525,14000]

所以我可以读到最多要分配5484118016个字节, 已经在使用4670717952个字节,并且要分配另一个777.72MB = 775720000个字节。 5484118016字节 - 4670717952字节 - 775720000字节= 37680064字节根据我的计算器。 因此,在为他想要推进的新Tensor分配空间后,仍然应该有37MB的免费VRAM。这对我来说似乎也是非常合理的,因为Tensorflow可能(我猜?)不会尝试分配比现有的VRAM更多的VRAM,只是将其余的数据保留在RAM或其他东西中。



config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
with tf.Session(config = config) as s:


 gpu_options.allocator_type = 'BFC'




非常感谢你, 利昂

(gputensorflow) leon@ljksUbuntu:~/Tensorflow$ python 
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.40GiB
I tensorflow/core/common_runtime/gpu/] DMA: 0 
I tensorflow/core/common_runtime/gpu/] 0:   Y 
I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/] Bin (256):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (512):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (1024):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (2048):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (16384):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (32768):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (65536):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (131072):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (262144):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (524288):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (1048576):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (2097152):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (4194304):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (8388608):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (16777216):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (33554432):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (67108864):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (134217728):     Total Chunks: 1, Chunks in use: 0 147.20MiB allocated for chunks. 147.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin (268435456):     Total Chunks: 1, Chunks in use: 0 628.52MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/] Bin for 775.72MiB was 256.00MiB, Chunk State: 
I tensorflow/core/common_runtime/]   Size: 628.52MiB | Requested Size: 0B | in_use: 0, prev:   Size: 147.20MiB | Requested Size: 147.20MiB | in_use: 1, next:   Size: 54.8KiB | Requested Size: 54.7KiB | in_use: 1
I tensorflow/core/common_runtime/] Chunk at 0x10208000000 of size 1280
I tensorflow/core/common_runtime/] Chunk at 0x10208000500 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x10208000600 of size 56064
I tensorflow/core/common_runtime/] Chunk at 0x1020800e100 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x1020800e200 of size 44288
I tensorflow/core/common_runtime/] Chunk at 0x10208018f00 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x10208019000 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x10208019100 of size 813400064
I tensorflow/core/common_runtime/] Chunk at 0x102387d1100 of size 56064
I tensorflow/core/common_runtime/] Chunk at 0x102387dec00 of size 154350080
I tensorflow/core/common_runtime/] Chunk at 0x10241b11e00 of size 44288
I tensorflow/core/common_runtime/] Chunk at 0x10241b1cb00 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x10241b1cc00 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x10241b1cd00 of size 154350080
I tensorflow/core/common_runtime/] Chunk at 0x102722d4d00 of size 56064
I tensorflow/core/common_runtime/] Chunk at 0x1027b615a00 of size 44288
I tensorflow/core/common_runtime/] Chunk at 0x1027b620700 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x1027b620800 of size 256
I tensorflow/core/common_runtime/] Chunk at 0x1027b620900 of size 813400064
I tensorflow/core/common_runtime/] Chunk at 0x102abdd8900 of size 813400064
I tensorflow/core/common_runtime/] Chunk at 0x102dc590900 of size 56064
I tensorflow/core/common_runtime/] Chunk at 0x102dc59e400 of size 56064
I tensorflow/core/common_runtime/] Chunk at 0x102dc5abf00 of size 154350080
I tensorflow/core/common_runtime/] Chunk at 0x102e58df100 of size 154350080
I tensorflow/core/common_runtime/] Chunk at 0x102eec12300 of size 44288
I tensorflow/core/common_runtime/] Chunk at 0x102eec1d000 of size 44288
I tensorflow/core/common_runtime/] Chunk at 0x102eec27d00 of size 1612612352
I tensorflow/core/common_runtime/] Free at 0x1024ae4ff00 of size 659049984
I tensorflow/core/common_runtime/] Free at 0x102722e2800 of size 154350080
I tensorflow/core/common_runtime/]      Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/] 8 Chunks of size 256 totalling 2.0KiB
I tensorflow/core/common_runtime/] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/] 5 Chunks of size 44288 totalling 216.2KiB
I tensorflow/core/common_runtime/] 5 Chunks of size 56064 totalling 273.8KiB
I tensorflow/core/common_runtime/] 4 Chunks of size 154350080 totalling 588.80MiB
I tensorflow/core/common_runtime/] 3 Chunks of size 813400064 totalling 2.27GiB
I tensorflow/core/common_runtime/] 1 Chunks of size 1612612352 totalling 1.50GiB
I tensorflow/core/common_runtime/] Sum Total of in-use chunks: 4.35GiB
I tensorflow/core/common_runtime/] Stats: 
Limit:                  5484118016
InUse:                  4670717952
MaxInUse:               5484118016
NumAllocs:                      29
MaxAllocSize:           1612612352

W tensorflow/core/common_runtime/] *********************___________*__***************************************************xxxxxxxxxxxxxx
W tensorflow/core/common_runtime/] Ran out of memory trying to allocate 775.72MiB.  See logs for memory state.
W tensorflow/core/framework/] Resource exhausted: OOM when allocating tensor with shape[14525,14000]
Traceback (most recent call last):
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/", line 1022, in _do_call
    return fn(*args)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/", line 1004, in _run_fn
    status, run_metadata)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/", line 66, in __exit__
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/", line 469, in raise_exception_on_not_ok_status
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000]
     [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 171, in <module>
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/", line 767, in run
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000]
     [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]

Caused by op 'rnn/basic_lstm_cell/weights/Initializer/random_uniform', defined at:
  File "", line 94, in <module>
    initial_state=initial_state, time_major=False)       # time_major = FALSE currently
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 545, in dynamic_rnn
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 712, in _dynamic_rnn_loop
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 2626, in while_loop
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 2459, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 2409, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 697, in _time_step
    (output, new_state) = call_cell()
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 683, in <lambda>
    call_cell = lambda: cell(input_t, state)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/", line 179, in __call__
    concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/", line 747, in _linear
    "weights", [total_arg_size, output_size], dtype=dtype)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 988, in get_variable
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 890, in get_variable
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 348, in get_variable
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 333, in _true_getter
    caching_device=caching_device, validate_shape=validate_shape)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 684, in _get_single_variable
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 226, in __init__
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 303, in _init_from_args
    initial_value(), name="initial_value", dtype=dtype)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 673, in <lambda>
    shape.as_list(), dtype=dtype, partition_info=partition_info)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 360, in __call__
    dtype, seed=self.seed)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 246, in random_uniform
    return math_ops.add(rnd * (maxval - minval), minval, name=name)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/", line 73, in add
    result = _op_def_lib.apply_op("Add", x=x, y=y, name=name)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/", line 763, in apply_op
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/", line 1264, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[14525,14000]
     [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]

8 个答案:

答案 0 :(得分:3)



注意不要在同一个上运行评估和训练二进制文件   GPU或者你可能会耗尽内存。考虑运行   评估单独的GPU(如果可用)或暂停培训   在同一GPU上运行评估时使用二进制文件。

答案 1 :(得分:2)

我通过减少batch_size=52来解决此问题 仅减少内存使用是为了减少batch_size。



请选择此Another Stack Overflow Link

答案 2 :(得分:1)

我遇到了同样的问题。我关闭了所有anaconda提示符窗口,并清除了所有python任务。重新打开Anaconda提示窗口并执行train.py文件。下一次对我有用。 Anaconda和Python终端占用了内存,没有为训练过程留出空间。



答案 3 :(得分:0)

在GPU上遇到OOM时,我认为更改batch size是首先尝试的正确选项。


对于不同的GPU,您可能需要基于GPU的不同批量大小   你有记忆。






答案 4 :(得分:0)


答案 5 :(得分:0)




应清除以前的模型。来自 销毁当前的TF图并创建一个新的图。有用以避免旧模型/图层混乱。 运行并保存一个模型后,清除会话,然后运行下一个模型。

答案 6 :(得分:0)

使用Ctrl + Shift + Esc检查系统使用情况。您的内存不足。结束任务以执行不需要的任务。它对我有用。

答案 7 :(得分:0)

我发现原因真的很愚蠢。我想检查 NN 的架构,所以我在终端中加载了 tensorflow。即使 tensorflow 已加载且未使用,它仍在分配资源。我关闭了终端,OOM消失了