我试图使用tfrecords训练MTCNN的PNet。最初,损失在前几个时期平稳下降,然后变为“ nan”,模型权重也是如此。
下面是我的模型结构和训练结果:
def pnet_train1(train_with_landmark = False):
X = Input(shape = (12, 12, 3), name = 'Pnet_input')
M = Conv2D(10, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!!
M = Conv2D(16, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)
M = Conv2D(32, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)
Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)
Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv)
if train_with_landmark:
Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor])
model = Model(X, Pnet_output)
else:
Pnet_output = Concatenate()([Classifier, Bbox_regressor])
model = Model(X, Pnet_output)
return model
model = pnet_train1(True)
model.compile(optimizer = Adam(lr = 0.001), loss = custom_loss)
model.fit(ds, steps_per_epoch = 1636, epochs = 100, validation_data = ds, validation_steps = 1636)
pnet training records
我知道可能有某些原因,所以我已经尝试了以下检查:
model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 74us/sample - loss: 0.1579
Epoch 2/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1574
Epoch 3/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1567
Epoch 4/10
1792/1792 [==============================] - 0s 65us/sample - loss: 0.1550
Epoch 5/10
1792/1792 [==============================] - 0s 61us/sample - loss: 0.1556
Epoch 6/10
1792/1792 [==============================] - 0s 70us/sample - loss: 0.1527
Epoch 7/10
1792/1792 [==============================] - 0s 71us/sample - loss: 0.1532
Epoch 8/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1509
Epoch 9/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1501
Epoch 10/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1495
Out[111]: <tensorflow.python.keras.callbacks.History at 0x1f767efa088>
temp_weights_list = []
for layer in model.layers:
temp_layer = model.get_layer(layer.name)
temp_weights = temp_layer.get_weights()
temp_weights_list.append(temp_weights)
model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 70us/sample - loss: nan
Epoch 2/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 3/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 4/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
然后,我重新加载模型结构,并加载在损失变得怪异之前经过训练的正常权重,“ nan”又发生了。
然后我从数据集中提取了另一批样本,发生了相同的情况。
因此,现在我认为我的数据集是正确的(我使用img_string = open(img_path, 'rb').read()
来生成TFRecords,如果图像损坏或图像路径非法,它将引发错误)。
def pnet_train2(train_with_landmark = False):
X = Input(shape = (12, 12, 3), name = 'Pnet_input')
M = Conv2D(10, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
M = BatchNormalization(axis = -1, name = 'Pnet_bn1')(M)
M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!!
M = Conv2D(16, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
M = BatchNormalization(axis = -1, name = 'Pnet_bn2')(M)
M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)
M = Conv2D(32, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
M = BatchNormalization(axis = -1, name = 'Pnet_bn3')(M)
M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)
Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)
Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv)
if train_with_landmark:
Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor])
model = Model(X, Pnet_output)
else:
Pnet_output = Concatenate()([Classifier, Bbox_regressor])
model = Model(X, Pnet_output)
return model
然后,我将初始学习率从0.001(这是Keras中adam的默认值)更改为0.0001。同样的事情一直发生。由于学习率降低,因此过程变慢了。
为了进一步考虑,float32使-2,147,483,648和2,147,483,647之间的值全部可用。考虑到以前的权重都是10 ^ -2左右,我认为这可能是导致损失的分母变为〜0。但是对于“ mse”,没有除法运算,因此我仍然对真正的原因感到困惑。
我真的不确定这是我的错误还是某些错误。因此,任何建议表示赞赏。提前致谢。
更新于2020.04.10 01:57:
我试图找到损失错的时代。具有正常损失的最后一个时期将模型权重更新为“ nan”。因此,我认为问题发生在反向传播上(正常损耗(0.0814)-反向支撑-模型权重为'nan'-下一个损耗为'nan')。我使用以下代码来获取模型权重:
model.fit(x, y, batch_size = 1792, epochs = 1)
Train on 1792 samples
1792/1792 [==============================] - 0s 223us/sample - loss: 0.0814
Out[36]: <tensorflow.python.keras.callbacks.History at 0x205ff7b0188>
temp_weights_list1 = []
for layer in model.layers:
temp_layer = model.get_layer(layer.name)
temp_weights = temp_layer.get_weights()
temp_weights_list1.append(temp_weights)