TensorFlow2-tf.keras:训练MTCNN PNet时,损失和模型权重突然变成'nan'

时间:2020-04-09 07:27:22

标签: tensorflow keras deep-learning gradient loss

我试图使用tfrecords训练MTCNN的PNet。最初,损失在前几个时期平稳下降,然后变为“ nan”,模型权重也是如此。

下面是我的模型结构和训练结果:

def pnet_train1(train_with_landmark = False):

    X = Input(shape = (12, 12, 3), name = 'Pnet_input')

    M = Conv2D(10, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! 

    M = Conv2D(16, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)

    M = Conv2D(32, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)

    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)

    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) 
    if train_with_landmark: 
        Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) 
        model = Model(X, Pnet_output) 
    else:
        Pnet_output = Concatenate()([Classifier, Bbox_regressor])
        model = Model(X, Pnet_output)

    return model

model = pnet_train1(True)
model.compile(optimizer = Adam(lr = 0.001), loss = custom_loss)
model.fit(ds, steps_per_epoch = 1636, epochs = 100, validation_data = ds, validation_steps = 1636)

pnet training records
我知道可能有某些原因,所以我已经尝试了以下检查:

  1. 检查数据集以查看是否存在不良数据:
    我的数据集是:
    X:形状为(12,12,3)的图像;
    Y:形状,形状(17,)串联在一起的标签,框回归坐标和6地标回归坐标。
    对于标签,它可能是1,-1、0,-2,其中只有标签1和0会参与计算我自己编写的自定义损失。
    对于ROI和地标坐标,它们都属于[-1,1]。
    对于图像数据,在将其发送到训练流之前,将被处理为(x-127.5)/128。
    为了验证是否是导致“ nan”损失的数据,我从数据集中提取了一批(例如1792个样本)作为numpy数组((1792,12,12,3),(1792,17))。仅训练这一批数据仍然会引起问题。在损失变为“ nan”之前的那个时期,损失似乎很正常,并且所有模型权重都属于(-1,1),它们都是非常小的值:
model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 74us/sample - loss: 0.1579
Epoch 2/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1574
Epoch 3/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1567
Epoch 4/10
1792/1792 [==============================] - 0s 65us/sample - loss: 0.1550
Epoch 5/10
1792/1792 [==============================] - 0s 61us/sample - loss: 0.1556
Epoch 6/10
1792/1792 [==============================] - 0s 70us/sample - loss: 0.1527
Epoch 7/10
1792/1792 [==============================] - 0s 71us/sample - loss: 0.1532
Epoch 8/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1509
Epoch 9/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1501
Epoch 10/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1495
Out[111]: <tensorflow.python.keras.callbacks.History at 0x1f767efa088>

temp_weights_list = []
for layer in model.layers:

    temp_layer = model.get_layer(layer.name)
    temp_weights = temp_layer.get_weights()
    temp_weights_list.append(temp_weights)

model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 70us/sample - loss: nan
Epoch 2/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 3/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 4/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan

然后,我重新加载模型结构,并加载在损失变得怪异之前经过训练的正常权重,“ nan”又发生了。
然后我从数据集中提取了另一批样本,发生了相同的情况。
因此,现在我认为我的数据集是正确的(我使用img_string = open(img_path, 'rb').read()来生成TFRecords,如果图像损坏或图像路径非法,它将引发错误)。

  1. 由于通常是梯度爆炸问题,所以我尝试:
    添加BatchNormalization层,
    在重量上加上L2-Norm,
    使用Xavier初始化
    并选择较小的学习率。这是我的新模型结构:
def pnet_train2(train_with_landmark = False):

    X = Input(shape = (12, 12, 3), name = 'Pnet_input')

    M = Conv2D(10, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn1')(M)
    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! 

    M = Conv2D(16, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn2')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)

    M = Conv2D(32, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn3')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)

    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)

    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) 
    if train_with_landmark: 
        Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) 
        model = Model(X, Pnet_output) 
    else:
        Pnet_output = Concatenate()([Classifier, Bbox_regressor])
        model = Model(X, Pnet_output)

    return model

然后,我将初始学习率从0.001(这是Keras中adam的默认值)更改为0.0001。同样的事情一直发生。由于学习率降低,因此过程变慢了。

  1. 我想到的最后一件事是我给自己写的习俗损失。它专门用于MTCNN多任务问题。由于我模型的输出形状为(17,)/(批量大小为17),因此我将损失替换为'mse'进行检查。它可能不会获得很好的回归结果,但至少应该可以工作。发生了同样的事情-在损失减少了几个纪元之后,损失和模型权重变成了'nan'...

为了进一步考虑,float32使-2,147,483,648和2,147,483,647之间的值全部可用。考虑到以前的权重都是10 ^ -2左右,我认为这可能是导致损失的分母变为〜0。但是对于“ mse”,没有除法运算,因此我仍然对真正的原因感到困惑。

我真的不确定这是我的错误还是某些错误。因此,任何建议表示赞赏。提前致谢。

更新于2020.04.10 01:57:
我试图找到损失错的时代。具有正常损失的最后一个时期将模型权重更新为“ nan”。因此,我认为问题发生在反向传播上(正常损耗(0.0814)-反向支撑-模型权重为'nan'-下一个损耗为'nan')。我使用以下代码来获取模型权重:

model.fit(x, y, batch_size = 1792, epochs = 1)
Train on 1792 samples
1792/1792 [==============================] - 0s 223us/sample - loss: 0.0814
Out[36]: <tensorflow.python.keras.callbacks.History at 0x205ff7b0188>

temp_weights_list1 = []
for layer in model.layers:

    temp_layer = model.get_layer(layer.name)
    temp_weights = temp_layer.get_weights()
    temp_weights_list1.append(temp_weights)

0 个答案:

没有答案