我正在尝试使用Keras和Tensorflow-GPU来训练2D卷积LSTM。该模型在开始训练时编译但很快就会遇到内存不足错误。
模型以(batch_size,timesteps,135,240,1)的形式输入,其中batch_size是视频的数量,时间步长是视频中的帧数。我将batch_size锁定为1(一次一个视频),但时间步长可以在600到4,800帧之间变化,具体取决于视频长度。
标签形状为(batch_size,time_steps,9),其中9是模型预测值的类数。
模型摘要
Layer (type) Output Shape Param #
=================================================================
conv_lst_m2d_1 (ConvLSTM2D) (None, None, 135, 240, 40 59200
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 135, 240, 40 160
_________________________________________________________________
average_pooling3d_1 (Average (None, None, 1, 1, 40) 0
_________________________________________________________________
reshape_1 (Reshape) (None, None, 40) 0
_________________________________________________________________
dense_1 (Dense) (None, None, 9) 369
=================================================================
Total params: 59,729
Trainable params: 59,649
Non-trainable params: 80
设备放置日志
2018-03-28 11:40:16.994858: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-03-28 11:40:17.254698: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate(GHz): 1.038
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 5.02GiB
2018-03-28 11:40:17.260611: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1312] Adding visible gpu devices: 0
2018-03-28 11:40:17.520790: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4790 MB memory) -> physical GPU (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2
2018-03-28 11:40:17.975718: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\direct_session.cc:297] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2
终端输出
如果我让它运行,训练会话会以下列格式不断转储到STDOUT:
2018-03-28 11:45:29.748269: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 000000056FD23200 of size 5184000
它偶尔会转储这样的行:
2018-03-28 11:45:30.203961: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_driver.cc:967] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2018-03-28 11:45:30.209571: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 17179869184
我想知道模型是否过于复杂或者GPU的层数太多,或者单个批次的输入数据是否太大。我尝试过使用6 GB VRAM(如上所示)的GTX 970M以及4 GB的GTX 980进行培训。我的同事也试过在8GB的GTX 1080上运行它。错误在所有三个版本中都存在。
编辑3/28/2018 13:20
我应该澄清一些其他细节。我正在使用fit_generator训练模型,我正在传递Keras.util.Sequence的自定义子类。如果它是相关的,这是我的子类的源代码:
class ROASequence(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x = x_set
self.y = y_set
self.batch_size = batch_size
def __len__(self):
return abs(int(np.ceil(len(self.x) / float(self.batch_size))))
def __getitem__(self, idx):
x_paths = self.x[idx * self.batch_size: (idx + 1) * self.batch_size]
y_paths = self.y[idx * self.batch_size: (idx + 1) * self.batch_size]
batch_x = []
batch_y = []
for xpath, ypath in zip(x_paths, y_paths):
sample_x, sample_y = unpack_sample(xpath, ypath)
batch_x.append(sample_x)
batch_y.append(sample_y)
batch_x = np.array(batch_x)
batch_y = np.array(batch_y)
print(batch_x.shape, batch_y.shape)
return batch_x, batch_y
编辑3/28/2018 14:00
在上面复制的日志中,CUDA_OUT_OF_MEMORY错误会出现警告,例如"无法分配17179869184字节。"那超过17 GB。什么可能是这个巨大的内存需求的主要贡献者?同样,我的输入形状是(batch_size,time_steps,135,240,1),(batch_size,time_steps,9),其中batch_size设置为1,time_steps的上限为4,800。我不确定这在内存需求方面如何扩展,或者ConvLSTM2D模型如何影响它。是不是我的Sequence子类实现和我对fit_generator的使用导致模型一次加载多个视频?
答案 0 :(得分:0)
看起来问题是一个过于雄心勃勃的架构。我大大降低了模型的复杂性,所以看起来像这样:
Layer (type) Output Shape Param #
=================================================================
conv_lst_m2d_1 (ConvLSTM2D) (None, None, 135, 240, 4) 736
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 135, 240, 4) 16
_________________________________________________________________
average_pooling3d_1 (Average (None, None, 1, 1, 4) 0
_________________________________________________________________
reshape_1 (Reshape) (None, None, 4) 0
_________________________________________________________________
dense_1 (Dense) (None, None, 9) 45
=================================================================
Total params: 797
Trainable params: 789
Non-trainable params: 8
您会看到我将ConvLSTM2D的 filter_size 值从40更改为4。
此外,我将视频数据剪辑为每个视频100帧。因此,我的输入形状现在是(batch_size,100,135,240,1),(batch_size,100,9)。
完成这两项更改后,模型现在可以编译和训练而不会耗尽内存。我的下一步将是弄清楚如何修改我的序列,以便它将每个视频分成100帧的片段(这样我就不会在前100帧之后丢弃所有内容)。我也可能会使用NumPy缩小我的帧,并且有一些我可以剔除的低对比度帧。但是,这些问题都超出了这个问题的范围。