深度学习:学习率过高时

时间:2020-06-15 04:31:43

标签: python tensorflow keras deep-learning

当我在Keras中改变SGD的学习率时,我发现代码中确实有些奇怪:

def build_mlp():
    model = Sequential()
    model.add(Conv2D(24, nb_row=3, nb_col=3, border_mode='same', activation='relu', input_shape=(28, 28, 1)))
    model.add(BatchNormalization(momentum=0.8))
    model.add(Conv2D(24, nb_row=3, nb_col=3, border_mode='same', activation='relu'))
    model.add(BatchNormalization(momentum=0.8))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(10, activation='softmax'))
    model.summary()

    return model


model = build_mlp()
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.0005), metrics=['accuracy'])

在使用MNIST数据集进行训练的过程中,我每5个时期的学习率就会翻倍。我预计,随着学习率的提高,损失将会分散并振荡。但是,我发现学习率从0.4提高到0.8后,损失和准确性不再改变。部分记录在这里:

Epoch, Learning rate, Accuracy, Loss
45,0.05119999870657921,0.67200000166893,5.286721663475037
46,0.05119999870657921,0.44419999949634076,8.957198877334594
47,0.05119999870657921,0.21029999982565642,12.728459935188294
48,0.05119999870657921,0.09939999926835298,14.515956773757935
49,0.05119999870657921,0.09949999924749137,14.514344959259033
50,0.10239999741315842,0.09939999926835298,14.515956773757935
51,0.10239999741315842,0.09979999924078584,14.509509530067444
52,0.10239999741315842,0.10109999923035502,14.488556008338929
53,0.10239999741315842,0.10089999923482537,14.49177963256836
54,0.10239999741315842,0.09979999924078584,14.509509530067444
55,0.20479999482631683,0.09899999927729368,14.522404017448425
56,0.20479999482631683,0.10129999965429307,14.4853324508667
57,0.20479999482631683,0.10119999963790179,14.486944255828858
58,0.20479999482631683,0.10129999965429307,14.4853324508667
59,0.20479999482631683,0.10119999963790179,14.486944255828858
60,0.40959998965263367,0.10129999965429307,14.4853324508667
61,0.40959998965263367,0.10119999963790179,14.486944255828858
62,0.40959998965263367,0.10129999965429307,14.4853324508667
63,0.40959998965263367,0.10139999965205788,14.48372064113617
64,0.40959998965263367,0.09189999906346202,14.636842398643493
65,0.8191999793052673,0.10099999930709601,14.490167903900147
66,0.8191999793052673,0.10099999930709601,14.490167903900147
67,0.8191999793052673,0.10099999930709601,14.490167903900147
68,0.8191999793052673,0.10099999930709601,14.490167903900147
69,0.8191999793052673,0.10099999930709601,14.490167903900147
70,1.6383999586105347,0.10099999930709601,14.490167903900147
71,1.6383999586105347,0.10099999930709601,14.490167903900147
72,1.6383999586105347,0.10099999930709601,14.490167903900147
73,1.6383999586105347,0.10099999930709601,14.490167903900147

我们可以看到,在第65个时期之后,损失固定在14.490167903900147,并且不再改变。对这种现象有任何想法吗?任何建议表示赞赏!

1 个答案:

答案 0 :(得分:1)

发生的事情是您的高学习率使该层的权重超出了范围。依次导致softmax函数输出恰好为0和1或非常接近那些数字的值。网络变得“过于自信”。

因此,无论输入什么,您的网络都会输出如下10维向量:

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
...

平均而言,它每10次会猜对一次,因此准确度保持在10%。

要计算网络的损耗,Keras会计算每个样本的损耗,然后将其平均。在这种情况下,损失就是分类交叉熵,它等于目标标签概率的负对数。

如果为1,则负数为0:

-np.log(1.0) = 0.0

但是如果它是0呢?未定义对数0,因此Keras为该值添加了一些平滑处理:

-np.log(0.0000001) = 16.11809565095832

因此,每10个样本中有9个损失为16.11809565095832,而每10个样本中有1个损失为0。因此,平均而言:

16.11809565095832 * 0.9 = 14.506286085862488