我正在用Keras实现一个简单的CNN,并试图在Adam中设置分层学习率。我参考了this tutorial。修改后的亚当,如下所示:
class Adam_lr_mult(Optimizer):
def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999,
epsilon=None, decay=0., amsgrad=False,
multipliers=None, debug_verbose=True,**kwargs):
...'''Omitted'''
self.multipliers = multipliers
self.layerwise_lr={} # record layer-wise lr
self.debug_verbose = debug_verbose
@interfaces.legacy_get_updates_support
def get_updates(self, loss, params):
grads = self.get_gradients(loss, params)
self.updates = [K.update_add(self.iterations, 1)]
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations,
K.dtype(self.decay))))
t = K.cast(self.iterations, K.floatx()) + 1
lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
(1. - K.pow(self.beta_1, t)))
ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
if self.amsgrad:
vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
else:
vhats = [K.zeros(1) for _ in params]
self.weights = [self.iterations] + ms + vs + vhats
for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
# Learning rate multipliers
if self.multipliers:
multiplier = [mult for mult in self.multipliers if mult in p.name]
if self.debug_verbose:
print('parameter: ',p.name)
else:
multiplier = None
if multiplier:
new_lr_t = lr_t * self.multipliers[multiplier[0]]
self.layerwise_lr[multiplier[0]] = K.get_value(new_lr_t)
if self.debug_verbose:
print('Setting {} to learning rate : {}'.format(multiplier[0], new_lr_t))
print('learning rate:',K.get_value(new_lr_t))
print('Dict:',self.layerwise_lr)
print('\n')
else:
new_lr_t = lr_t
self.layerwise_lr[p.name.split('/')[0]] = K.get_value(new_lr_t)
if self.debug_verbose:
print('No change in learning rate : {}'.format(p.name))
print('learning rate:',K.get_value(new_lr_t))
print('Dict:',self.layerwise_lr)
print('\n')
...'''Omitted'''
print(***__Hello__***)
return self.updates
此外,我使用ReduceLROnPlateau和CSVLogger回调函数来记录学习率。为了记录有关逐层学习率的更多信息,我还修改了ReduceLROnPlateau:
class ReduceLROnPlateau_lr_mult(Callback):
def __init__(self, monitor='val_loss', factor=0.1, patience=10,
verbose=0, mode='auto', min_delta=1e-4, cooldown=0, min_lr=0,
**kwargs):
...'''Omitted'''
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
logs['lr'] = K.get_value(self.model.optimizer.lr)
logs.update(self.model.optimizer.layerwise_lr) # only add this line
...'''Omitted'''
为了测试修订的Adam和ReduceLROnPlateau函数,我使用MNIST数据集并构建了一个仅包含4个卷积层,4个batch_normalization层和1个密集层的简单CNN,代码和结果如下所示:
# Learning multiplier
lr_multipliers = {}
lr_multipliers['conv2d_1'] = 0.8
lr_multipliers['batch_normalization_1'] = 0.8
lr_multipliers['conv2d_2'] = 0.6
lr_multipliers['conv2d_3'] = 0.4
lr_multipliers['conv2d_4'] = 0.2
# Adam with layer-wise lr
adam_lr_mult = Adam_lr_mult(multipliers=lr_multipliers)
# ReduceLROnPlateau with layer-wise lr dict
lr_reducer_mult = ReduceLROnPlateau_lr_mult(monitor='val_loss',factor=np.sqrt(0.1),
cooldown=0, patience=5, min_lr=0.00001, mode='auto', verbose=1)
model.compile(loss='categorical_crossentropy', optimizer=adam_lr_mult, metrics=['accuracy'])
model.fit(...,verbose=2,
callbacks=[lr_reducer_mult, early_stopper, csv_logger, ...])
结果显示:
...'''Omitted'''
parameter: batch_normalization_4/beta:0
No change in learning rate : batch_normalization_4/beta:0
learning rate: 0.00031623512
Dict: {'conv2d_1': 0.00025298807, 'batch_normalization_1': 0.00025298807,...}
parameter: dense_1/kernel:0
No change in learning rate : dense_1/kernel:0
learning rate: 0.00031623512
Dict: {'conv2d_1': 0.00025298807, 'batch_normalization_1': 0.00025298807,...}
parameter: dense_1/bias:0
No change in learning rate : dense_1/bias:0
learning rate: 0.00031623512
Dict: {'conv2d_1': 0.00025298807, 'batch_normalization_1': 0.00025298807,...}
***__Hello__***
Train on 48000 samples, validate on 12000 samples
Epoch 1/100
- 4s - loss: 0.4517 - acc: 0.9098 - val_loss: 0.2027 - val_acc: 0.9572
Epoch 2/100
- 2s - loss: 0.1029 - acc: 0.9827 - val_loss: 0.1374 - val_acc: 0.9718
Epoch 3/100
- 2s - loss: 0.0739 - acc: 0.9905 - val_loss: 0.0929 - val_acc: 0.9833
Epoch 4/100
- 2s - loss: 0.0604 - acc: 0.9939 - val_loss: 0.0815 - val_acc: 0.9865
Epoch 5/100
- 2s - loss: 0.0513 - acc: 0.9959 - val_loss: 0.0785 - val_acc: 0.9864
Epoch 6/100
- 2s - loss: 0.0448 - acc: 0.9979 - val_loss: 0.1081 - val_acc: 0.9759
Epoch 7/100
- 2s - loss: 0.0405 - acc: 0.9984 - val_loss: 0.0752 - val_acc: 0.9864
Epoch 8/100
- 2s - loss: 0.0368 - acc: 0.9990 - val_loss: 0.1382 - val_acc: 0.9666
Epoch 9/100
- 2s - loss: 0.0337 - acc: 0.9996 - val_loss: 0.0659 - val_acc: 0.9890
Epoch 10/100
- 2s - loss: 0.0314 - acc: 0.9998 - val_loss: 0.0746 - val_acc: 0.9860
...
Epoch 25/100
- 2s - loss: 0.0177 - acc: 0.9991 - val_loss: 0.1212 - val_acc: 0.9731
Epoch 00025: ReduceLROnPlateau reducing learning rate to 0.00031622778103685084.
Epoch 26/100
- 2s - loss: 0.0155 - acc: 0.9998 - val_loss: 0.0446 - val_acc: 0.9915
Epoch 27/100
- 2s - loss: 0.0146 - acc: 1.0000 - val_loss: 0.0422 - val_acc: 0.9926
我的问题:
我不确定我要去哪里,logs['lr']
在CSV file中有变化,但dictionary "layerwise_lr"
没有。为了找出问题,我加了一行
print(***__Hello__***)
在亚当中,它只出现一次。这让我感到困惑,有关设置学习率的信息仅在第一个时期之前出现,而不再出现。有人可以给我一些建议吗?非常感谢!