Question

我试图使用tfrecords训练MTCNN的PNet。最初，损失在前几个时期平稳下降，然后变为“ nan”，模型权重也是如此。

下面是我的模型结构和训练结果：

def pnet_train1(train_with_landmark = False):

    X = Input(shape = (12, 12, 3), name = 'Pnet_input')

    M = Conv2D(10, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! 

    M = Conv2D(16, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)

    M = Conv2D(32, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)

    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)

    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) 
    if train_with_landmark: 
        Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) 
        model = Model(X, Pnet_output) 
    else:
        Pnet_output = Concatenate()([Classifier, Bbox_regressor])
        model = Model(X, Pnet_output)

    return model

model = pnet_train1(True)
model.compile(optimizer = Adam(lr = 0.001), loss = custom_loss)
model.fit(ds, steps_per_epoch = 1636, epochs = 100, validation_data = ds, validation_steps = 1636)

pnet training records
我知道可能有某些原因，所以我已经尝试了以下检查：

检查数据集以查看是否存在不良数据：
我的数据集是：
X：形状为（12，12，3）的图像；
Y：形状，形状（17，）串联在一起的标签，框回归坐标和6地标回归坐标。
对于标签，它可能是1，-1、0，-2，其中只有标签1和0会参与计算我自己编写的自定义损失。
对于ROI和地标坐标，它们都属于[-1，1]。
对于图像数据，在将其发送到训练流之前，将被处理为（x-127.5）/128。
为了验证是否是导致“ nan”损失的数据，我从数据集中提取了一批（例如1792个样本）作为numpy数组（（1792，12，12，3），（1792，17））。仅训练这一批数据仍然会引起问题。在损失变为“ nan”之前的那个时期，损失似乎很正常，并且所有模型权重都属于（-1，1），它们都是非常小的值：

model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 74us/sample - loss: 0.1579
Epoch 2/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1574
Epoch 3/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1567
Epoch 4/10
1792/1792 [==============================] - 0s 65us/sample - loss: 0.1550
Epoch 5/10
1792/1792 [==============================] - 0s 61us/sample - loss: 0.1556
Epoch 6/10
1792/1792 [==============================] - 0s 70us/sample - loss: 0.1527
Epoch 7/10
1792/1792 [==============================] - 0s 71us/sample - loss: 0.1532
Epoch 8/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1509
Epoch 9/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1501
Epoch 10/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1495
Out[111]: <tensorflow.python.keras.callbacks.History at 0x1f767efa088>

temp_weights_list = []
for layer in model.layers:

    temp_layer = model.get_layer(layer.name)
    temp_weights = temp_layer.get_weights()
    temp_weights_list.append(temp_weights)

model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 70us/sample - loss: nan
Epoch 2/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 3/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 4/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan

然后，我重新加载模型结构，并加载在损失变得怪异之前经过训练的正常权重，“ nan”又发生了。
然后我从数据集中提取了另一批样本，发生了相同的情况。
因此，现在我认为我的数据集是正确的（我使用img_string = open(img_path, 'rb').read()来生成TFRecords，如果图像损坏或图像路径非法，它将引发错误）。

由于通常是梯度爆炸问题，所以我尝试：
添加BatchNormalization层，
在重量上加上L2-Norm，
使用Xavier初始化
并选择较小的学习率。这是我的新模型结构：

def pnet_train2(train_with_landmark = False):

    X = Input(shape = (12, 12, 3), name = 'Pnet_input')

    M = Conv2D(10, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn1')(M)
    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! 

    M = Conv2D(16, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn2')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)

    M = Conv2D(32, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn3')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)

    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)

    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) 
    if train_with_landmark: 
        Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) 
        model = Model(X, Pnet_output) 
    else:
        Pnet_output = Concatenate()([Classifier, Bbox_regressor])
        model = Model(X, Pnet_output)

    return model

然后，我将初始学习率从0.001（这是Keras中adam的默认值）更改为0.0001。同样的事情一直发生。由于学习率降低，因此过程变慢了。

我想到的最后一件事是我给自己写的习俗损失。它专门用于MTCNN多任务问题。由于我模型的输出形状为（17，）/（批量大小为17），因此我将损失替换为'mse'进行检查。它可能不会获得很好的回归结果，但至少应该可以工作。发生了同样的事情-在损失减少了几个纪元之后，损失和模型权重变成了'nan'...

为了进一步考虑，float32使-2,147,483,648和2,147,483,647之间的值全部可用。考虑到以前的权重都是10 ^ -2左右，我认为这可能是导致损失的分母变为〜0。但是对于“ mse”，没有除法运算，因此我仍然对真正的原因感到困惑。

我真的不确定这是我的错误还是某些错误。因此，任何建议表示赞赏。提前致谢。

更新于2020.04.10 01:57：
我试图找到损失错的时代。具有正常损失的最后一个时期将模型权重更新为“ nan”。因此，我认为问题发生在反向传播上（正常损耗（0.0814）-反向支撑-模型权重为'nan'-下一个损耗为'nan'）。我使用以下代码来获取模型权重：

model.fit(x, y, batch_size = 1792, epochs = 1)
Train on 1792 samples
1792/1792 [==============================] - 0s 223us/sample - loss: 0.0814
Out[36]: <tensorflow.python.keras.callbacks.History at 0x205ff7b0188>

temp_weights_list1 = []
for layer in model.layers:

    temp_layer = model.get_layer(layer.name)
    temp_weights = temp_layer.get_weights()
    temp_weights_list1.append(temp_weights)

TensorFlow2-tf.keras：训练MTCNN PNet时，损失和模型权重突然变成'nan'

0 个答案: