带有Tensorflow后端的Keras --- model.fit()中带有检查点回调的内存错误

时间:2019-02-01 07:24:56

标签: python tensorflow keras

我正在尝试训练自动编码器。它会不断从model.fit()的Keras处获取Memoryerror,当我将与验证相关的参数添加到model.fit时(例如validation_split),总是会发生这种情况。

错误:

Traceback (most recent call last):
  File "/root/abnormal-spatiotemporal-ae/start_train.py", line 53, in <module>
    train(dataset=dataset, job_folder=job_folder, logger=logger)
  File "/root/abnormal-spatiotemporal-ae/classifier.py", line 109, in train
    callbacks=[snapshot, earlystop, history_log]
  File "/root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py",
     

第990行,合适           y,val_y =(slice_arrays(y,0,split_at),         文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils.py”,   slice_arrays中的第528行           返回[如果x为其他则为None,否则x [start:stop]对于数组中的x]         文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils.py”,   528行,在           返回[如果x为其他则为None,否则x [start:stop]对于数组中的x]         文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/io_utils.py”,   第110行,在 getitem           返回self.data [idx]         在h5py._objects.with_phil.wrapper中的文件“ h5py / _objects.pyx”第54行         在h5py._objects.with_phil.wrapper中的文件“ h5py / _objects.pyx”,第55行         文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/h5py/_hl/dataset.py”,   第485行,位于 getitem           arr = numpy.ndarray(mshape,new_dtype,order ='C')       MemoryError

代码:

data = HDF5Matrix(os.path.join(video_root_path, '{0}/{0}_train_t{1}.h5'.format(dataset, time_length)),
                  'data')

snapshot = ModelCheckpoint(os.path.join(job_folder,
           'model_snapshot_e{epoch:03d}_{val_loss:.6f}.h5'))
earlystop = EarlyStopping(patience=10)
history_log = LossHistory(job_folder=job_folder, logger=logger)

logger.info("Initializing training...")

history = model.fit(
    data,
    data,
    batch_size=batch_size,
    epochs=nb_epoch,
    validation_split=0.15,
    shuffle='batch',
    callbacks=[snapshot, earlystop, history_log]
)

当我在model.fit中删除validate_split = 0.15并在回调中删除快照时,代码将正确运行。

数据变量包含训练数据集中的所有已处理图像, 其形状为(15200、8、224、224、1),尺寸为6101401600 该代码在具有64GB RAM和Tesla P100的计算机上使用,不用担心内存空间,我的python是64位

型号:

input_tensor = Input(shape=(t, 224, 224, 1))

    conv1 = TimeDistributed(Conv2D(128, kernel_size=(11, 11), padding='same', strides=(4, 4), name='conv1'),
                            input_shape=(t, 224, 224, 1))(input_tensor)
    conv1 = TimeDistributed(BatchNormalization())(conv1)
    conv1 = TimeDistributed(Activation('relu'))(conv1)

    conv2 = TimeDistributed(Conv2D(64, kernel_size=(5, 5), padding='same', strides=(2, 2), name='conv2'))(conv1)
    conv2 = TimeDistributed(BatchNormalization())(conv2)
    conv2 = TimeDistributed(Activation('relu'))(conv2)

    convlstm1 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm1')(conv2)
    convlstm2 = ConvLSTM2D(32, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm2')(convlstm1)
    convlstm3 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm3')(convlstm2)

    deconv1 = TimeDistributed(Conv2DTranspose(128, kernel_size=(5, 5), padding='same', strides=(2, 2), name='deconv1'))(convlstm3)
    deconv1 = TimeDistributed(BatchNormalization())(deconv1)
    deconv1 = TimeDistributed(Activation('relu'))(deconv1)

    decoded = TimeDistributed(Conv2DTranspose(1, kernel_size=(11, 11), padding='same', strides=(4, 4), name='deconv2'))(
        deconv1)

1 个答案:

答案 0 :(得分:0)

This question面临相同的问题。在这里的解释是,在平坦化层之前有太多数据点。这导致RAM溢出。通过添加其他卷积层可以解决此问题。