Question

我有以下功能（作为自定义Keras图层的一部分，但这并不重要）：

def call(self, input, mask=None):
    x = K.dot(input[0], self.W_emb) + self.b_emb
    bucket_size = input[1][0][0]

    stack = K.zeros((self.batch_size, self.max_len, 2*self.hidden_dim))
    cursors = K.concatenate([K.ones((self.batch_size, 1)), K.zeros((self.batch_size, self.max_len-1))], axis=1)
    stack_mask = K.zeros((self.batch_size, self.max_len))

    results, _ = T.scan(self.encoder_step,
                        outputs_info=[stack, cursors, stack_mask],
                        non_sequences=[x, mask[0]],
                        n_steps=2*bucket_size)
    last_value = results[0][-1]
    return last_value[:,0,self.hidden_dim:]

self.encoder_step执行一些递归计算。

如果我用中等大小的params运行该函数（bucket_size = 128，self.batch_size = 64，self.max_len = 128，self.hidden_dim = 256），那么我得到CNMEM_STATUS_OUT_OF_MEMORY。 exception_verbosity = high的错误日志显示theano确实已经分配了整数张量＆f; forall_inplace，gpu，scan_fn} .0＆＃39;形状（X，64,128,512），其中X是某个数字，最多2 * bucket_size。似乎theano仍然存储每一步的扫描输出值，即使我不使用它们。

日志示例：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/NLP_RL/train_char.py", line 229, in run_training_RL
    loss1 = encoder.train_on_batch(batch[0], batch[1])
  File "/usr/local/lib/python3.4/dist-packages/keras/engine/training.py", line 1239, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python3.4/dist-packages/keras/backend/theano_backend.py", line 792, in __call__
    return self.function(*inputs)
  File "/usr/local/lib/python3.4/dist-packages/theano/compile/function_module.py", line 871, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/usr/local/lib/python3.4/dist-packages/theano/gof/link.py", line 314, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/usr/local/lib/python3.4/dist-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.4/dist-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
MemoryError: Error allocating 3758096384 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).
Apply node that caused the error: GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Elemwise{sub,no_inplace}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0)
Toposort index: 121
Inputs types: [CudaNdarrayType(float32, (True, True, True, True)), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(1, 1, 1, 1), (), (), (), ()]
Inputs strides: [(0, 0, 0, 0), (), (), (), ()]
Inputs values: [b'CudaNdarray([[[[ 0.]]]])', array(224), array(64), array(128), array(512)]
Outputs clients: [[GpuIncSubtensor{InplaceInc;int64}(GpuAlloc{memset_0=True}.0, GpuIncSubtensor{InplaceInc;::, int64, int64::}.0, Constant{-1})]]
...
Storage map footprint:
 - forall_inplace,gpu,scan_fn}.0, Shape: (225, 64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 3774873600 Byte(s)
 - GpuAlloc{memset_0=True}.0, Shape: (225, 64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 3774873600 Byte(s)
 - GpuElemwise{Add}[(0, 0)].0, Shape: (64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 16777216 Byte(s)
 - <CudaNdarrayType(float32, 3D)>, Shared Input, Shape: (64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 16777216 Byte(s)
 - forall_inplace,gpu,scan_fn}.1, Shape: (225, 64, 128), ElemSize: 4 Byte(s), TotalSize: 7372800 Byte(s)
 - forall_inplace,gpu,scan_fn}.2, Shape: (225, 64, 128), ElemSize: 4 Byte(s), TotalSize: 7372800 Byte(s)
 - input_8, Input, Shape: (64, 128, 83), ElemSize: 4 Byte(s), TotalSize: 2719744 Byte(s)
 - GpuReshape{2}.0, Shape: (8192, 83), ElemSize: 4 Byte(s), TotalSize: 2719744 Byte(s)
...

如何正确使用扫描来排除中间结果的存储？

theano扫描中的内存错误

0 个答案: