我有以下功能(作为自定义Keras图层的一部分,但这并不重要):
def call(self, input, mask=None):
x = K.dot(input[0], self.W_emb) + self.b_emb
bucket_size = input[1][0][0]
stack = K.zeros((self.batch_size, self.max_len, 2*self.hidden_dim))
cursors = K.concatenate([K.ones((self.batch_size, 1)), K.zeros((self.batch_size, self.max_len-1))], axis=1)
stack_mask = K.zeros((self.batch_size, self.max_len))
results, _ = T.scan(self.encoder_step,
outputs_info=[stack, cursors, stack_mask],
non_sequences=[x, mask[0]],
n_steps=2*bucket_size)
last_value = results[0][-1]
return last_value[:,0,self.hidden_dim:]
self.encoder_step执行一些递归计算。
如果我用中等大小的params运行该函数(bucket_size = 128,self.batch_size = 64,self.max_len = 128,self.hidden_dim = 256),那么我得到CNMEM_STATUS_OUT_OF_MEMORY。 exception_verbosity = high的错误日志显示theano确实已经分配了整数张量&f; forall_inplace,gpu,scan_fn} .0'形状(X,64,128,512),其中X是某个数字,最多2 * bucket_size。 似乎theano仍然存储每一步的扫描输出值,即使我不使用它们。
日志示例:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/NLP_RL/train_char.py", line 229, in run_training_RL
loss1 = encoder.train_on_batch(batch[0], batch[1])
File "/usr/local/lib/python3.4/dist-packages/keras/engine/training.py", line 1239, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python3.4/dist-packages/keras/backend/theano_backend.py", line 792, in __call__
return self.function(*inputs)
File "/usr/local/lib/python3.4/dist-packages/theano/compile/function_module.py", line 871, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/local/lib/python3.4/dist-packages/theano/gof/link.py", line 314, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python3.4/dist-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.4/dist-packages/theano/compile/function_module.py", line 859, in __call__
outputs = self.fn()
MemoryError: Error allocating 3758096384 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).
Apply node that caused the error: GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Elemwise{sub,no_inplace}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0)
Toposort index: 121
Inputs types: [CudaNdarrayType(float32, (True, True, True, True)), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(1, 1, 1, 1), (), (), (), ()]
Inputs strides: [(0, 0, 0, 0), (), (), (), ()]
Inputs values: [b'CudaNdarray([[[[ 0.]]]])', array(224), array(64), array(128), array(512)]
Outputs clients: [[GpuIncSubtensor{InplaceInc;int64}(GpuAlloc{memset_0=True}.0, GpuIncSubtensor{InplaceInc;::, int64, int64::}.0, Constant{-1})]]
...
Storage map footprint:
- forall_inplace,gpu,scan_fn}.0, Shape: (225, 64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 3774873600 Byte(s)
- GpuAlloc{memset_0=True}.0, Shape: (225, 64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 3774873600 Byte(s)
- GpuElemwise{Add}[(0, 0)].0, Shape: (64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 16777216 Byte(s)
- <CudaNdarrayType(float32, 3D)>, Shared Input, Shape: (64, 128, 512), ElemSize: 4 Byte(s), TotalSize: 16777216 Byte(s)
- forall_inplace,gpu,scan_fn}.1, Shape: (225, 64, 128), ElemSize: 4 Byte(s), TotalSize: 7372800 Byte(s)
- forall_inplace,gpu,scan_fn}.2, Shape: (225, 64, 128), ElemSize: 4 Byte(s), TotalSize: 7372800 Byte(s)
- input_8, Input, Shape: (64, 128, 83), ElemSize: 4 Byte(s), TotalSize: 2719744 Byte(s)
- GpuReshape{2}.0, Shape: (8192, 83), ElemSize: 4 Byte(s), TotalSize: 2719744 Byte(s)
...
如何正确使用扫描来排除中间结果的存储?