运行训练脚本会导致GPU优化的Ubuntu计算机出现内存错误。该错误看起来很可疑,因为机器有足够的内存来运行算法。
这是错误:
TensorFlow:尝试分配16.0KiB时内存不足
记忆状态:
$ free -m
total used free shared buff/cache available
Mem: 15038 190 6580 8 8267 14670
Swap: 0 0 0
错误控制台:
$ python ./train.py --run --continue
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Loading data..
Number of categories: 2
Number of samples 425
/home/ubuntu/DeepClassificationBot-master/data.py:134: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
val = np.random.choice(dataset_indx, size=number_of_samples)
/home/ubuntu/DeepClassificationBot-master/data.py:127: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
train = np.random.choice(dataset_indx, size=number_of_samples)
Building and Compiling model..
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Training..
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 16.0KiB was 16.0KiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702580000 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702581b00 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583800 of size 147456
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7025a7800 of size 147456
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7025cb800 of size 256
....... Very long list of chunks
I tensorflow/core/common_runtime/bfc_allocator.cc:689] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 115 Chunks of size 256 totalling 28.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 34 Chunks of size 512 totalling 17.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 21 Chunks of size 1024 totalling 21.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 42 Chunks of size 2048 totalling 84.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 6912 totalling 47.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 42 Chunks of size 16384 totalling 672.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 32768 totalling 160.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 147456 totalling 1008.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 294912 totalling 1.97MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 589824 totalling 3.94MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 1179648 totalling 7.88MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 14 Chunks of size 2359296 totalling 31.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 4718592 totalling 31.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 35 Chunks of size 9437184 totalling 315.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 67108864 totalling 320.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 6 Chunks of size 411041792 totalling 2.30GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 663988224 totalling 633.23MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 3.61GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit: 3878682624
InUse: 3878682624
MaxInUse: 3878682624
NumAllocs: 362
MaxAllocSize: 663988224
W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************************************************************************************xxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 16.0KiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:930] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized.
[[Node: Variable_91/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [4096] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
File "./train.py", line 154, in <module>
run(extract=extract_mode, cont=continue_)
File "./train.py", line 104, in run
sample_weight=None)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit
self._make_train_function()
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function
training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss)
File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 314, in get_updates
vs = [K.variable(np.zeros(K.get_value(p).shape)) for p in params]
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 78, in variable
get_session().run(v.initializer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 710, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 908, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 958, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 978, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Dst tensor is not initialized.
[[Node: Variable_91/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [4096] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'Variable_91/initial_value', defined at:
File "./train.py", line 154, in <module>
run(extract=extract_mode, cont=continue_)
File "./train.py", line 104, in run
sample_weight=None)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit
self._make_train_function()
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function
training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss)
File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 314, in get_updates
vs = [K.variable(np.zeros(K.get_value(p).shape)) for p in params]
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 75, in variable
v = tf.Variable(np.asarray(value, dtype=dtype), name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 211, in __init__
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 289, in _init_from_args
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 628, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 180, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 167, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2317, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1239, in __init__
self._traceback = _extract_stack()
这样可行但仍会发出内存警告:
$ python deploy.py --URL http://www.example.com/image.jpg
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
1/1 [==============================] - 2s
______________________________________________
Image Name: image.jpg
Categories:
1. shrek 100.00%
2. darth vader 0.00%
答案 0 :(得分:0)
我发布了this answer on another question,但我想我会在这里发布以获取可见性。我不确定你的模型的大小,但在我的情况下,我运行的是一个较小的模型,并且有点惊讶,这么小的模型给出了OOM错误。
具体来说,我在GTX 970上训练一个小型CNN时遇到内存不足错误。通过一些侥幸,我发现告诉TensorFlow根据需要在GPU上分配内存(而不是在前面)解决所有问题我的问题。这可以使用以下Python代码完成:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)
以前,TensorFlow会预先分配~90%的GPU内存。但是,由于某种未知的原因,当我增加网络的大小时,这将导致内存不足错误。通过使用上面的代码,我不再有OOM错误。