训练ConvLSTMD2D模型时CUDA内存不足

时间:2018-03-28 17:02:22

标签: tensorflow keras conv-neural-network lstm

我正在尝试使用Keras和Tensorflow-GPU来训练2D卷积LSTM。该模型在开始训练时编译但很快就会遇到内存不足错误。

模型以(batch_size,timesteps,135,240,1)的形式输入,其中batch_size是视频的数量,时间步长是视频中的帧数。我将batch_size锁定为1(一次一个视频),但时间步长可以在600到4,800帧之间变化,具体取决于视频长度。

标签形状为(batch_size,time_steps,9),其中9是模型预测值的类数。

模型摘要

Layer (type)                 Output Shape              Param #
=================================================================
conv_lst_m2d_1 (ConvLSTM2D)  (None, None, 135, 240, 40 59200
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 135, 240, 40 160
_________________________________________________________________
average_pooling3d_1 (Average (None, None, 1, 1, 40)    0
_________________________________________________________________
reshape_1 (Reshape)          (None, None, 40)          0
_________________________________________________________________
dense_1 (Dense)              (None, None, 9)           369
=================================================================
Total params: 59,729
Trainable params: 59,649
Non-trainable params: 80

设备放置日志

2018-03-28 11:40:16.994858: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-03-28 11:40:17.254698: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate(GHz): 1.038
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 5.02GiB
2018-03-28 11:40:17.260611: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1312] Adding visible gpu devices: 0
2018-03-28 11:40:17.520790: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4790 MB memory) -> physical GPU (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2
2018-03-28 11:40:17.975718: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\direct_session.cc:297] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2

终端输出

如果我让它运行,训练会话会以下列格式不断转储到STDOUT:

2018-03-28 11:45:29.748269: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 000000056FD23200 of size 5184000

它偶尔会转储这样的行:

2018-03-28 11:45:30.203961: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_driver.cc:967] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2018-03-28 11:45:30.209571: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 17179869184

Screenshot of STDOUT

我想知道模型是否过于复杂或者GPU的层数太多,或者单个批次的输入数据是否太大。我尝试过使用6 GB VRAM(如上所示)的GTX 970M以及4 GB的GTX 980进行培训。我的同事也试过在8GB的GTX 1080上运行它。错误在所有三个版本中都存在。

编辑3/28/2018 13:20

我应该澄清一些其他细节。我正在使用fit_generator训练模型,我正在传递Keras.util.Sequence的自定义子类。如果它是相关的,这是我的子类的源代码:

class ROASequence(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        self.x = x_set
        self.y = y_set
        self.batch_size = batch_size

    def __len__(self):
        return abs(int(np.ceil(len(self.x) / float(self.batch_size))))

    def __getitem__(self, idx):
        x_paths = self.x[idx * self.batch_size: (idx + 1) * self.batch_size]
        y_paths = self.y[idx * self.batch_size: (idx + 1) * self.batch_size]
        batch_x = []
        batch_y = []
        for xpath, ypath in zip(x_paths, y_paths):
            sample_x, sample_y = unpack_sample(xpath, ypath)
            batch_x.append(sample_x)
            batch_y.append(sample_y)
        batch_x = np.array(batch_x)
        batch_y = np.array(batch_y)
        print(batch_x.shape, batch_y.shape)
        return batch_x, batch_y

编辑3/28/2018 14:00

在上面复制的日志中,CUDA_OUT_OF_MEMORY错误会出现警告,例如"无法分配17179869184字节。"那超过17 GB。什么可能是这个巨大的内存需求的主要贡献者?同样,我的输入形状是(batch_size,time_steps,135,240,1),(batch_size,time_steps,9),其中batch_size设置为1,time_steps的上限为4,800。我不确定这在内存需求方面如何扩展,或者ConvLSTM2D模型如何影响它。是不是我的Sequence子类实现和我对fit_generator的使用导致模型一次加载多个视频?

1 个答案:

答案 0 :(得分:0)

看起来问题是一个过于雄心勃勃的架构。我大大降低了模型的复杂性,所以看起来像这样:

Layer (type)                 Output Shape              Param #
=================================================================
conv_lst_m2d_1 (ConvLSTM2D)  (None, None, 135, 240, 4) 736
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 135, 240, 4) 16
_________________________________________________________________
average_pooling3d_1 (Average (None, None, 1, 1, 4)     0
_________________________________________________________________
reshape_1 (Reshape)          (None, None, 4)           0
_________________________________________________________________
dense_1 (Dense)              (None, None, 9)           45
=================================================================
Total params: 797
Trainable params: 789
Non-trainable params: 8

您会看到我将ConvLSTM2D的 filter_size 值从40更改为4。

此外,我将视频数据剪辑为每个视频100帧。因此,我的输入形状现在是(batch_size,100,135,240,1),(batch_size,100,9)。

完成这两项更改后,模型现在可以编译和训练而不会耗尽内存。我的下一步将是弄清楚如何修改我的序列,以便它将每个视频分成100帧的片段(这样我就不会在前100帧之后丢弃所有内容)。我也可能会使用NumPy缩小我的帧,并且有一些我可以剔除的低对比度帧。但是,这些问题都超出了这个问题的范围。