这是我在尝试64gb ram CPU
时试图运行的训练代码
迷恋RTX 2070
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7
tf.keras.backend.set_session(tf.Session(config=config))
model = efn.EfficientNetB7()
model.summary()
# create new output layer
output_layer = Dense(5, activation='sigmoid', name="retrain_output")(model.get_layer('top_dropout').output)
new_model = Model(model.input, output=output_layer)
new_model.summary()
# lock previous weights
for i, l in enumerate(new_model.layers):
if i < 228:
l.trainable = False
# lock probs weights
new_model.compile(loss='mean_squared_error', optimizer='adam')
batch_size = 5
samples_per_epoch = 30
epochs = 20
# generate train data
train_datagen = ImageDataGenerator(
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
validation_split=0)
train_generator = train_datagen.flow_from_directory(
train_data_input_folder,
target_size=(input_dim, input_dim),
batch_size=batch_size,
class_mode='categorical',
seed=2019,
subset='training')
validation_generator = train_datagen.flow_from_directory(
validation_data_input_folder,
target_size=(input_dim, input_dim),
batch_size=batch_size,
class_mode='categorical',
seed=2019,
subset='validation')
new_model.fit_generator(
train_generator,
samples_per_epoch=samples_per_epoch,
epochs=epochs,
validation_steps=20,
validation_data=validation_generator,
nb_worker=24)
new_model.save(model_output_path)
exception:
2019-11-17 08:52:52.903583:我 tensorflow / stream_executor / dso_loader.cc:152]已成功打开CUDA 本地库libcublas.so.10.0 ....... 2019-11-17 08:53:24.713020: 我tensorflow / core / common_runtime / bfc_allocator.cc:641] 110个块 大小27724800总2.84GiB 2019-11-17 08:53:24.713024:I tensorflow / core / common_runtime / bfc_allocator.cc:641] 6个块 38814720总计222.10MiB 2019-11-17 08:53:24.713027:I tensorflow / core / common_runtime / bfc_allocator.cc:641] 23个块 54000128总计1.16GiB 2019-11-17 08:53:24.713031:I tensorflow / core / common_runtime / bfc_allocator.cc:641] 1个大小块 73760000总计70.34MiB 2019-11-17 08:53:24.713034:I tensorflow / core / common_runtime / bfc_allocator.cc:645]总和 使用中的块:5.45GiB 2019-11-17 08:53:24.713040:I tensorflow / core / common_runtime / bfc_allocator.cc:647]统计信息:限制: 5856749158使用中:5848048896最大使用中:5848061440编号:6140 MaxAllocSize:3259170816
2019-11-17 08:53:24.713214:W tensorflow / core / common_runtime / bfc_allocator.cc:271] ****************************************************** ****************************************************** 2019-11-17 08:53:24.713232:W tensorflow / core / framework / op_kernel.cc:1401] OP_REQUIRES在以下位置失败 cwise_ops_common.cc:70:资源耗尽:分配时OOM 张量为shape [5,1344,38,38]且类型为float的张量 / job:本地主机/副本:0 /任务:0 /设备:GPU:0通过分配器GPU_0_bfc 追溯(最近一次通话):文件 “ /home/naort/Desktop/deep-learning-data-preparation-tools/EfficientNet-Transfer-Learning-BoilerPlate/model_retrain.py”, 第76行,在nb_worker = 24中) “ /usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, 第91行,在包装器中返回func(* args,** kwargs)File “ /usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, 第1732行,在fit_generator中,initial_epoch = initial_epoch) “ /usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, 在fit_generator中的第220行reset_metrics = False)文件 “ /usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, 第1514行,位于train_on_batch输出中= self.train_function(ins)文件 “ /home/naort/.local/lib/python3.6/site-packages/tensorflow/python/keras/backend.py”, 行3076,在调用run_metadata = self.run_metadata)文件中 “ /home/naort/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”, 调用run_metadata_ptr中的第1439行)文件 “ /home/naort/.local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py”, 第528行,位于出口c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError:OOM 在分配带有shape [5,1344,38,38]的张量并在类型为float时 / job:本地主机/副本:0 /任务:0 /设备:GPU:0通过分配器GPU_0_bfc [[{{节点 训练/亚当/梯度/ AddN_387-0-TransposeNHWCToNCHW-LayoutOptimizer}}]] 提示:如果您想在发生OOM时看到分配的张量列表, 将report_tensor_allocations_upon_oom添加到当前的RunOptions 分配信息。
[[{{node Mean}}]]提示:如果要查看已分配张量的列表 当发生OOM时,将report_tensor_allocations_upon_oom添加到RunOptions 有关当前分配信息。
答案 0 :(得分:2)
尽管EfficientNet模型的参数计数低于比较的ResNe(X)t模型,但它们仍消耗大量GPU内存。您看到的是GPU的内存不足错误(RTX 2070为8GB),而不是系统(64GB)。
B7机型,尤其是全分辨率机型,超出了您希望用于单张RTX 2070卡训练的范围。即使冻结很多层。
正在FP16中运行该模型的事情可能会有所帮助,这也将利用您的RTX卡的TensorCores。在https://medium.com/@noel_kennedy/how-to-use-half-precision-float16-when-training-on-rtx-cards-with-tensorflow-keras-d4033d59f9e4中,尝试以下操作:
import keras.backend as K
dtype='float16'
K.set_floatx(dtype)
# default is 1e-7 which is too small for float16. Without adjusting the epsilon, we will get NaN predictions because of divide by zero problems
K.set_epsilon(1e-4)