我有一个在ImageNet上训练的简单CNN模型。我使用keras.utils.multi_gpu_model进行多GPU培训。它工作正常,但是在尝试基于同一骨干网络训练SSD模型时遇到问题。它具有自定义丢失功能,并且在骨干网顶部具有几个自定义层:
model, predictor_sizes, input_encoder = build_model(input_shape=(args.img_height, args.img_width, 3),
n_classes=num_classes, mode='training')
optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
loss = SSDMultiBoxLoss(neg_pos_ratio=3, alpha=1.0)
if args.num_gpus > 1:
model = multi_gpu_model(model, gpus=args.num_gpus)
model.compile(optimizer=optimizer, loss=loss.compute_loss)
model.summary()
对于num_gpus==1
,我有以下摘要:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 512, 512, 3) 0
__________________________________________________________________________________________________
conv1_pad (Lambda) (None, 516, 516, 3) 0 input_1[0][0]
__________________________________________________________________________________________________
conv1 (Conv2D) (None, 256, 256, 16) 1216 conv1_pad[0][0]
__________________________________________________________________________________________________
conv1_bn (BatchNormalization) (None, 256, 256, 16) 64 conv1[0][0]
__________________________________________________________________________________________________
conv1_relu (Activation) (None, 256, 256, 16) 0 conv1_bn[0][0]
__________________________________________________________________________________________________
....
det_ctx6_2_mbox_loc_reshape[0][0]
__________________________________________________________________________________________________
mbox_priorbox (Concatenate) (None, None, 8) 0 det_ctx1_2_mbox_priorbox_reshape[
det_ctx2_2_mbox_priorbox_reshape[
det_ctx3_2_mbox_priorbox_reshape[
det_ctx4_2_mbox_priorbox_reshape[
det_ctx5_2_mbox_priorbox_reshape[
det_ctx6_2_mbox_priorbox_reshape[
__________________________________________________________________________________________________
mbox (Concatenate) (None, None, 33) 0 mbox_conf_softmax[0][0]
mbox_loc[0][0]
mbox_priorbox[0][0]
==================================================================================================
Total params: 1,890,510
Trainable params: 1,888,366
Non-trainable params: 2,144
但是,在多GPU情况下,我可以看到所有中间层都位于model
下:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 512, 512, 3) 0
__________________________________________________________________________________________________
lambda (Lambda) (None, 512, 512, 3) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 512, 512, 3) 0 input_1[0][0]
__________________________________________________________________________________________________
model (Model) (None, None, 33) 1890510 lambda[0][0]
lambda_1[0][0]
__________________________________________________________________________________________________
mbox (Concatenate) (None, None, 33) 0 model[1][0]
model[2][0]
==================================================================================================
Total params: 1,890,510
Trainable params: 1,888,366
Non-trainable params: 2,144
训练运行正常,但是我无法加载先前训练的砝码:
model.load_weights(args.weights, by_name=True)
由于错误:
ValueError: Layer #3 (named "model") expects 150 weight(s), but the saved weights have 68 element(s).
当然,经过预训练的模型只具有骨架的权重,而其余的对象检测模型则没有权重。
任何人都可以帮助我理解:
注意:我正在使用tf.Keras,它现在是Tensorflow的一部分。
答案 0 :(得分:0)
您可以在构建权重后立即加载权重,然后再转换为多gpu对应项。另外,对于单GPU版本和多GPU版本,您可以有两个对象,并使用第一个对象加载权重,并使用第二个对象进行训练。
答案 1 :(得分:0)
在编译多GPU模型时,尝试将结果模型返回到新的变量中,例如“ model_multiGPU”,然后在使用您在multi_gpu_model函数中输入的原始模型训练负载权重之后,即可解决问题。