运行Keras训练算法会在Mac上的Docker机器内运行时导致各种内存问题。
训练算法在Docker外的同一台机器上运行良好
将Docker内存从1 GB设置为8 GB(机器限制)无法提供帮助
最大视频内存:128 MB
来自Docker的不同TensorFlow(0.10.0
和0.11.0
)和Theano后端都显示类似错误
可能存在冲突docker ps -a
的其他Docker进程列表为空
问题是我在使用Docker 的机器上运行相同的训练算法 低得多的性能。所有错误都指向内存管理问题:
1)在容器docker build
进程中运行训练脚本时,原始错误符为 MemoryError ,并且在训练开始之前退出了该过程。
2)现在,在构建容器后运行docker run 058785edc11d python train.py --run
后,在分配带形状[64,64,254,254] 的张量时,我得到 ResourceExhaustedError:OOM(似乎走了一步)进一步):
Training..
Train on 385 samples, validate on 40 samples
Epoch 1/1
sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit
callback_metrics=callback_metrics)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop
outs = f(ins_batch)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__
updated = session.run(self.outputs + self.updates, feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,64,254,254]
[[Node: transpose_2 = Transpose[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Conv2D, transpose_2/perm)]]
Caused by op u'transpose_2', defined at:
File "train.py", line 138, in <module>
run(extract=extract_mode, cont=continue_)
File "train.py", line 79, in run
model = m.get_model(n_outputs=num_categories, input_size=size)
File "/tmp/model.py", line 24, in get_model
conv.add(Convolution2D(64, 3, 3, activation='relu', input_shape=(3, input_size, input_size)))
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 110, in add
layer.create_input_layer(batch_input_shape, input_dtype)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 341, in create_input_layer
self(x)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/convolutional.py", line 341, in call
filter_shape=self.W_shape)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 997, in conv2d
x = tf.transpose(x, (0, 3, 1, 2))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1051, in transpose
ret = gen_array_ops.transpose(a, perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2489, in transpose
result = _op_def_lib.apply_op("Transpose", x=x, perm=perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()
3)删除退出的docker容器后,减少培训批量大小我得到 std :: bad_alloc :
Training..
Train on 404 samples, validate on 21 samples
Epoch 1/1
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
4)另一个常见错误资源耗尽:分配张量形状时的OOM [25088,4096]
$ docker run f825faab715c python train.py --run --continue
libdc1394 error: Failed to initialize libdc1394
Using TensorFlow backend.
/tmp/data.py:134: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
val = np.random.choice(dataset_indx, size=number_of_samples)
/tmp/data.py:127: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
train = np.random.choice(dataset_indx, size=number_of_samples)
Loading data..
Number of categories: 2
Number of samples 425
Building and Compiling model..
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096]
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[4096,4096]
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096]
[[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[25088,4096]
[[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
Training..
Train on 404 samples, validate on 21 samples
Epoch 1/1
Traceback (most recent call last):
File "train.py", line 138, in <module>
run(extract=extract_mode, cont=continue_)
File "train.py", line 100, in run
sample_weight=None)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit
callback_metrics=callback_metrics)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop
outs = f(ins_batch)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__
updated = session.run(self.outputs + self.updates, feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[25088,4096]
[[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
Caused by op u'gradients/MatMul_grad/MatMul_1', defined at:
File "train.py", line 138, in <module>
run(extract=extract_mode, cont=continue_)
File "train.py", line 100, in run
sample_weight=None)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit
self._make_train_function()
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function
training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss)
File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 307, in get_updates
grads = self.get_gradients(loss, params)
File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 48, in get_gradients
grads = K.gradients(loss, params)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 666, in gradients
return tf.gradients(loss, variables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients
in_grads = _AsList(grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py", line 637, in _MatMulGrad
math_ops.matmul(op.inputs[0], grad, transpose_a=True))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul
transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()
...which was originally created as op u'MatMul', defined at:
File "train.py", line 138, in <module>
run(extract=extract_mode, cont=continue_)
File "train.py", line 79, in run
model = m.get_model(n_outputs=num_categories, input_size=size)
File "/tmp/model.py", line 70, in get_model
conv.add(Dense(4096))
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 142, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 628, in call
output = K.dot(x, self.W)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 214, in dot
out = tf.matmul(x, y)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul
transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()
答案 0 :(得分:1)
可能是你的训练算法需要比8GB更多的内存。我以前遇到过这样的问题,但增加记忆总能解决问题。您的错误 ResourceExhaustedError:OOM在分配带形状的张量[64,64,254,254] 时清楚地表明您的资源已耗尽,并且需要更多内存来运行您的应用程序。