Question

这个问题可能类似于Keras uses way too much GPU memory when calling train_on_batch, fit, etc，但是我在那里找不到答案/解决方法，因此请在此处发布。它也以Issue in Keras的形式发布。

我有一个模型，在某个模型中，我想在某个时候使用本地连接的图层合并两个数据要素，并使用4个额外的维度以更好地表示。由于这两个特征向量来自同一网络，因此与使用完全连接的层相比，本地合并应该更加有效。这是这种架构的简单玩具示例：

from keras.layers import Input, Flatten, LocallyConnected2D, Conv2D, Dense
from keras.models import Model
import keras.backend as K

def local_conv_sample(params):
    feat_dim = params['input_dim']
    M = params['agg_dim']

    input_s = Input(shape=(2, feat_dim, 1))

    output = Conv2D(M, (2,1), name='AggConv1')(input_s)
    output = LocallyConnected2D(1, (1,1), name='AggConv2')(output)
    output = Flatten()(output)

    output = Dense(1)(output)
    model = Model(inputs=input_s, outputs=output)

    model.compile(optimizer='sgd', loss='mse')
    return model

K.clear_session()
params = {'input_dim': 5000,
          'agg_dim': 4,
          'batch_size': 40}

model = local_conv_sample(params)
print model.summary()

这是模型摘要，以及编译后我的nvidia-smi信息。一切似乎都很好，直到这里：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 2, 5000, 1)        0         
_________________________________________________________________
AggConv1 (Conv2D)            (None, 1, 5000, 4)        12        
_________________________________________________________________
AggConv2 (LocallyConnected2D (None, 1, 5000, 1)        25000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 5000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5001      
=================================================================
Total params: 30,013
Trainable params: 30,013
Non-trainable params: 0


+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     20882      C   /usr/bin/python                               69MiB |
+-----------------------------------------------------------------------------+

问题：在我的11439MiB GPU中，这样一个简单的模型用尽了内存分配空间。它可以很好地编译模型（虽然要花一点时间），但是在调用fit时，GPU的使用率依次为98MiB，174MiB和OOM：

model.fit(np.random.rand(params['batch_size'], 2, params['input_dim'], 1),
          np.random.rand(params['batch_size']))

返回

I tensorflow/core/common_runtime/bfc_allocator.cc:676]      Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 26 Chunks of size 256 totalling 6.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 2 Chunks of size 1280 totalling 2.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 6 Chunks of size 20224 totalling 118.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 4 Chunks of size 80128 totalling 313.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2097152 totalling 2.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 4033 Chunks of size 2560000 totalling 9.62GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2795008 totalling 2.67MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2834432 totalling 2.70MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3108864 totalling 2.96MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3268608 totalling 3.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3657728 totalling 3.49MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3661824 totalling 3.49MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3977216 totalling 3.79MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4194304 totalling 4.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4390912 totalling 4.19MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4407296 totalling 4.20MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4755456 totalling 4.54MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 4763648 totalling 4.54MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 9.66GiB
****************************************************************************************************
W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[32,1,5000,4]

令人惊讶的是，即使使用Dense层而不是LocallyConnected2D，即使参数数是4000倍，也不会发生这种情况：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 2, 5000, 1)        0         
_________________________________________________________________
AggConv1 (Conv2D)            (None, 1, 5000, 4)        12        
_________________________________________________________________
flatten_1 (Flatten)          (None, 20000)             0         
_________________________________________________________________
AggDense (Dense)             (None, 5000)              100005000 
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5001      
=================================================================
Total params: 100,010,013
Trainable params: 100,010,013
Non-trainable params: 0
_________________________________________________________________

以上架构训练得很好。为什么会这样呢？分配4033 Chunks of size 2560000 totalling 9.62GiB似乎是问题所在。显然，它会将输入分配给LocallyConnected2D次（形状[None，1,5000,4]）feat_dim次。关于什么可以解决此问题的任何线索？

我的版本配置：在Python 2.7上运行，具有Keras 2.1.2和后端TensorFlow 1.4.0。

您可以通过增加/减小批次大小来玩玩具模型。减小批次大小对我不起作用，因为我的特征尺寸很大（〜8k），并且不能只包含20个样本的批次...我很惊讶一个参数很少的模型会占用这么多内存。

如何使Keras LocallyConnected层在内存方面高效？

0 个答案: