这个问题可能类似于Keras uses way too much GPU memory when calling train_on_batch, fit, etc,但是我在那里找不到答案/解决方法,因此请在此处发布。它也以Issue in Keras的形式发布。
我有一个模型,在某个模型中,我想在某个时候使用本地连接的图层合并两个数据要素,并使用4个额外的维度以更好地表示。由于这两个特征向量来自同一网络,因此与使用完全连接的层相比,本地合并应该更加有效。这是这种架构的简单玩具示例:
from keras.layers import Input, Flatten, LocallyConnected2D, Conv2D, Dense
from keras.models import Model
import keras.backend as K
def local_conv_sample(params):
feat_dim = params['input_dim']
M = params['agg_dim']
input_s = Input(shape=(2, feat_dim, 1))
output = Conv2D(M, (2,1), name='AggConv1')(input_s)
output = LocallyConnected2D(1, (1,1), name='AggConv2')(output)
output = Flatten()(output)
output = Dense(1)(output)
model = Model(inputs=input_s, outputs=output)
model.compile(optimizer='sgd', loss='mse')
return model
K.clear_session()
params = {'input_dim': 5000,
'agg_dim': 4,
'batch_size': 40}
model = local_conv_sample(params)
print model.summary()
这是模型摘要,以及编译后我的nvidia-smi信息。一切似乎都很好,直到这里:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 2, 5000, 1) 0
_________________________________________________________________
AggConv1 (Conv2D) (None, 1, 5000, 4) 12
_________________________________________________________________
AggConv2 (LocallyConnected2D (None, 1, 5000, 1) 25000
_________________________________________________________________
flatten_1 (Flatten) (None, 5000) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 5001
=================================================================
Total params: 30,013
Trainable params: 30,013
Non-trainable params: 0
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 20882 C /usr/bin/python 69MiB |
+-----------------------------------------------------------------------------+
问题:在我的11439MiB GPU中,这样一个简单的模型用尽了内存分配空间。它可以很好地编译模型(虽然要花一点时间),但是在调用fit
时,GPU的使用率依次为98MiB,174MiB和OOM:
model.fit(np.random.rand(params['batch_size'], 2, params['input_dim'], 1),
np.random.rand(params['batch_size']))
返回
I tensorflow/core/common_runtime/bfc_allocator.cc:676] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 26 Chunks of size 256 totalling 6.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 2 Chunks of size 1280 totalling 2.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 6 Chunks of size 20224 totalling 118.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 4 Chunks of size 80128 totalling 313.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2097152 totalling 2.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 4033 Chunks of size 2560000 totalling 9.62GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2795008 totalling 2.67MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2834432 totalling 2.70MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3108864 totalling 2.96MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3268608 totalling 3.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3657728 totalling 3.49MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3661824 totalling 3.49MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3977216 totalling 3.79MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4194304 totalling 4.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4390912 totalling 4.19MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4407296 totalling 4.20MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4755456 totalling 4.54MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4763648 totalling 4.54MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 9.66GiB
****************************************************************************************************
W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[32,1,5000,4]
令人惊讶的是,即使使用Dense
层而不是LocallyConnected2D
,即使参数数是4000倍,也不会发生这种情况:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 2, 5000, 1) 0
_________________________________________________________________
AggConv1 (Conv2D) (None, 1, 5000, 4) 12
_________________________________________________________________
flatten_1 (Flatten) (None, 20000) 0
_________________________________________________________________
AggDense (Dense) (None, 5000) 100005000
_________________________________________________________________
dense_1 (Dense) (None, 1) 5001
=================================================================
Total params: 100,010,013
Trainable params: 100,010,013
Non-trainable params: 0
_________________________________________________________________
以上架构训练得很好。为什么会这样呢?
分配4033 Chunks of size 2560000 totalling 9.62GiB
似乎是问题所在。显然,它会将输入分配给LocallyConnected2D
次(形状[None,1,5000,4])feat_dim
次。关于什么可以解决此问题的任何线索?
我的版本配置:
在Python 2.7
上运行,具有Keras 2.1.2
和后端TensorFlow 1.4.0
。
您可以通过增加/减小批次大小来玩玩具模型。减小批次大小对我不起作用,因为我的特征尺寸很大(〜8k),并且不能只包含20个样本的批次...我很惊讶一个参数很少的模型会占用这么多内存。>