在Mac上的Docker中运行Keras时出现内存问题

时间:2016-11-26 13:02:27

标签: python python-2.7 docker keras docker-machine

运行Keras训练算法会在Mac上的Docker机器内运行时导致各种内存问题。

  • 训练算法在Docker外的同一台机器上运行良好

  • 将Docker内存从1 GB设置为8 GB(机器限制)无法提供帮助

  • 最大视频内存:128 MB

  • 来自Docker的不同TensorFlow(0.10.00.11.0)和Theano后端都显示类似错误

  • 可能存在冲突docker ps -a的其他Docker进程列表为空

问题是我在使用Docker 机器上运行相同的训练算法 低得多的性能。所有错误都指向内存管理问题

1)在容器docker build进程中运行训练脚本时,原始错误符为 MemoryError ,并且在训练开始之前退出了该过程。

2)现在,在构建容器后运行docker run 058785edc11d python train.py --run后,在分配带形状[64,64,254,254] 的张量时,我得到 ResourceExhaustedError:OOM(似乎走了一步)进一步):

Training..
Train on 385 samples, validate on 40 samples
Epoch 1/1
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit
    callback_metrics=callback_metrics)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop
    outs = f(ins_batch)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__
    updated = session.run(self.outputs + self.updates, feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,64,254,254]
     [[Node: transpose_2 = Transpose[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Conv2D, transpose_2/perm)]]
Caused by op u'transpose_2', defined at:
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 79, in run
    model = m.get_model(n_outputs=num_categories, input_size=size)
  File "/tmp/model.py", line 24, in get_model
    conv.add(Convolution2D(64, 3, 3, activation='relu', input_shape=(3, input_size, input_size)))
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 110, in add
    layer.create_input_layer(batch_input_shape, input_dtype)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 341, in create_input_layer
    self(x)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/convolutional.py", line 341, in call
    filter_shape=self.W_shape)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 997, in conv2d
    x = tf.transpose(x, (0, 3, 1, 2))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1051, in transpose
    ret = gen_array_ops.transpose(a, perm, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2489, in transpose
    result = _op_def_lib.apply_op("Transpose", x=x, perm=perm, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

3)删除退出的docker容器后,减少培训批量大小我得到 std :: bad_alloc

Training..
Train on 404 samples, validate on 21 samples
Epoch 1/1
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

4)另一个常见错误资源耗尽:分配张量形状时的OOM [25088,4096]

$ docker run f825faab715c python train.py --run --continue
libdc1394 error: Failed to initialize libdc1394
Using TensorFlow backend.
/tmp/data.py:134: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  val = np.random.choice(dataset_indx, size=number_of_samples)
/tmp/data.py:127: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  train = np.random.choice(dataset_indx, size=number_of_samples)
Loading data..
Number of categories: 2
Number of samples 425
Building and Compiling model..
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096]
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[4096,4096]
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096]
     [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[25088,4096]
     [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
Training..
Train on 404 samples, validate on 21 samples
Epoch 1/1
Traceback (most recent call last):
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 100, in run
    sample_weight=None)
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit
    callback_metrics=callback_metrics)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop
    outs = f(ins_batch)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__
    updated = session.run(self.outputs + self.updates, feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[25088,4096]
     [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
Caused by op u'gradients/MatMul_grad/MatMul_1', defined at:
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 100, in run
    sample_weight=None)
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit
    self._make_train_function()
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function
    training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss)
  File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 307, in get_updates
    grads = self.get_gradients(loss, params)
  File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 48, in get_gradients
    grads = K.gradients(loss, params)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 666, in gradients
    return tf.gradients(loss, variables)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients
    in_grads = _AsList(grad_fn(op, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py", line 637, in _MatMulGrad
    math_ops.matmul(op.inputs[0], grad, transpose_a=True))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'MatMul', defined at:
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 79, in run
    model = m.get_model(n_outputs=num_categories, input_size=size)
  File "/tmp/model.py", line 70, in get_model
    conv.add(Dense(4096))
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 142, in add
    output_tensor = layer(self.outputs[0])
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 628, in call
    output = K.dot(x, self.W)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 214, in dot
    out = tf.matmul(x, y)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

1 个答案:

答案 0 :(得分:1)

可能是你的训练算法需要比8GB更多的内存。我以前遇到过这样的问题,但增加记忆总能解决问题。您的错误 ResourceExhaustedError:OOM在分配带形状的张量[64,64,254,254] 时清楚地表明您的资源已耗尽,并且需要更多内存来运行您的应用程序。