转移学习-尝试在内存中重新训练RTX 2070上的有效net-B07

时间:2019-11-17 06:59:43

标签: python tensorflow keras deep-learning efficientnet

这是我在尝试64gb ram CPU时试图运行的训练代码 迷恋RTX 2070

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7
tf.keras.backend.set_session(tf.Session(config=config))

model = efn.EfficientNetB7()
model.summary()

# create new output layer
output_layer = Dense(5, activation='sigmoid', name="retrain_output")(model.get_layer('top_dropout').output)
new_model = Model(model.input, output=output_layer)
new_model.summary()
# lock previous weights

for i, l in enumerate(new_model.layers):
    if i < 228:
        l.trainable = False
# lock probs weights

new_model.compile(loss='mean_squared_error', optimizer='adam')

batch_size = 5
samples_per_epoch = 30
epochs = 20

# generate train data
train_datagen = ImageDataGenerator(
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0)

train_generator = train_datagen.flow_from_directory(
    train_data_input_folder,
    target_size=(input_dim, input_dim),
    batch_size=batch_size,
    class_mode='categorical',
    seed=2019,
    subset='training')

validation_generator = train_datagen.flow_from_directory(
    validation_data_input_folder,
    target_size=(input_dim, input_dim),
    batch_size=batch_size,
    class_mode='categorical',
    seed=2019,
    subset='validation')

new_model.fit_generator(
    train_generator,
    samples_per_epoch=samples_per_epoch,
    epochs=epochs,
    validation_steps=20,
    validation_data=validation_generator,
    nb_worker=24)

new_model.save(model_output_path)



exception:
  

2019-11-17 08:52:52.903583:我   tensorflow / stream_executor / dso_loader.cc:152]已成功打开CUDA   本地库libcublas.so.10.0 ....... 2019-11-17 08:53:24.713020:   我tensorflow / core / common_runtime / bfc_allocator.cc:641] 110个块   大小27724800总2.84GiB 2019-11-17 08:53:24.713024:I   tensorflow / core / common_runtime / bfc_allocator.cc:641] 6个块   38814720总计222.10MiB 2019-11-17 08:53:24.713027:I   tensorflow / core / common_runtime / bfc_allocator.cc:641] 23个块   54000128总计1.16GiB 2019-11-17 08:53:24.713031:I   tensorflow / core / common_runtime / bfc_allocator.cc:641] 1个大小块   73760000总计70.34MiB 2019-11-17 08:53:24.713034:I   tensorflow / core / common_runtime / bfc_allocator.cc:645]总和   使用中的块:5.45GiB 2019-11-17 08:53:24.713040:I   tensorflow / core / common_runtime / bfc_allocator.cc:647]统计信息:限制:   5856749158使用中:5848048896最大使用中:5848061440编号:6140   MaxAllocSize:3259170816

     

2019-11-17 08:53:24.713214:W   tensorflow / core / common_runtime / bfc_allocator.cc:271]   ****************************************************** ****************************************************** 2019-11-17 08:53:24.713232:W   tensorflow / core / framework / op_kernel.cc:1401] OP_REQUIRES在以下位置失败   cwise_ops_common.cc:70:资源耗尽:分配时OOM   张量为shape [5,1344,38,38]且类型为float的张量   / job:本地主机/副本:0 /任务:0 /设备:GPU:0通过分配器GPU_0_bfc   追溯(最近一次通话):文件   “ /home/naort/Desktop/deep-learning-data-preparation-tools/EfficientNet-Transfer-Learning-BoilerPlate/model_retrain.py”,   第76行,在nb_worker = 24中)   “ /usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”,   第91行,在包装器中返回func(* args,** kwargs)File   “ /usr/local/lib/python3.6/dist-packages/keras/engine/training.py”,   第1732行,在fit_generator中,initial_epoch = initial_epoch)   “ /usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”,   在fit_generator中的第220行reset_metrics = False)文件   “ /usr/local/lib/python3.6/dist-packages/keras/engine/training.py”,   第1514行,位于train_on_batch输出中= self.train_function(ins)文件   “ /home/naort/.local/lib/python3.6/site-packages/tensorflow/python/keras/backend.py”,   行3076,在调用run_metadata = self.run_metadata)文件中   “ /home/naort/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”,   调用run_metadata_ptr中的第1439行)文件   “ /home/naort/.local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py”,   第528行,位于出口c_api.TF_GetCode(self.status.status))   tensorflow.python.framework.errors_impl.ResourceExhaustedError:OOM   在分配带有shape [5,1344,38,38]的张量并在类型为float时   / job:本地主机/副本:0 /任务:0 /设备:GPU:0通过分配器GPU_0_bfc   [[{{节点   训练/亚当/梯度/ AddN_387-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]   提示:如果您想在发生OOM时看到分配的张量列表,   将report_tensor_allocations_upon_oom添加到当前的RunOptions   分配信息。

     

[[{{node Mean}}]]提示:如果要查看已分配张量的列表   当发生OOM时,将report_tensor_allocations_upon_oom添加到RunOptions   有关当前分配信息。

1 个答案:

答案 0 :(得分:2)

尽管EfficientNet模型的参数计数低于比较的ResNe(X)t模型,但它们仍消耗大量GPU内存。您看到的是GPU的内存不足错误(RTX 2070为8GB),而不是系统(64GB)。

B7机型,尤其是全分辨率机型,超出了您希望用于单张RTX 2070卡训练的范围。即使冻结很多层。

正在FP16中运行该模型的事情可能会有所帮助,这也将利用您的RTX卡的TensorCores。在https://medium.com/@noel_kennedy/how-to-use-half-precision-float16-when-training-on-rtx-cards-with-tensorflow-keras-d4033d59f9e4中,尝试以下操作:

import keras.backend as K

dtype='float16'
K.set_floatx(dtype)

# default is 1e-7 which is too small for float16.  Without adjusting the epsilon, we will get NaN predictions because of divide by zero problems
K.set_epsilon(1e-4)