使用批量归一化tf.layers.batch_normalization
或tf.keras.layers.BatchNormalization
时,渐变消失
当我用张量图直方图打印出所有梯度时,将不会训练所有权重和偏差,这些变量上的梯度仅为零。
这是梯度结构,由
self.i_grads = tf.gradients(ys=self.loss_imitation, xs=self.e_params, grad_ys=None)
> self.i_grads: [<tf.Tensor 'imitation_train/gradients/actor_imitation/layer_1/MatMul_grad/MatMul_1:0' shape=(4, 185) dtype=float32>,
<tf.Tensor 'imitation_train/gradients/actor_imitation/layer_1/add_grad/Reshape_1:0' shape=(1, 185) dtype=float32>,
None,
None,
<tf.Tensor 'imitation_train/gradients/actor_imitation/layer_1/normalizer_actor_layer1/batchnorm/mul_grad/Mul_1:0' shape=(185,) dtype=float32>,
<tf.Tensor 'imitation_train/gradients/actor_imitation/layer_1/normalizer_actor_layer1/batchnorm/add_1_grad/Reshape_1:0' shape=(185,) dtype=float32>,
<tf.Tensor 'imitation_train/gradients/actor_imitation/action/MatMul_grad/MatMul_1:0' shape=(185, 2) dtype=float32>]
i_grads length: 7
这是tf.GraphKeys.GLOBAL_VARIABLES
:
self.e_params: [<tf.Variable 'actor_imitation/layer_1/weight_actor_layer1:0' shape=(4, 185) dtype=float32_ref>,
<tf.Variable 'actor_imitation/layer_1/bias_actor_layer1:0' shape=(1, 185) dtype=float32_ref>,
<tf.Variable 'actor_imitation/layer_1/batch_normalization_v1/gamma:0' shape=(4,) dtype=float32>,
<tf.Variable 'actor_imitation/layer_1/batch_normalization_v1/beta:0' shape=(4,) dtype=float32>,
<tf.Variable 'actor_imitation/layer_1/normalizer_actor_layer1/gamma:0' shape=(185,) dtype=float32>,
<tf.Variable 'actor_imitation/layer_1/normalizer_actor_layer1/beta:0' shape=(185,) dtype=float32>,
<tf.Variable 'actor_imitation/action/weight_actor_action:0' shape=(185, 2) dtype=float32_ref>]
e_params length: 7
该层的代码如下:
with tf.variable_scope(layer_scope):
w_collection = tf.get_variable(weight_scope, [layer_input_dim, hidden_neuro_dim], initializer=initializer_w, trainable=trainable)
b_collection = tf.get_variable(bias_scope, [1, hidden_neuro_dim], initializer=initializer_b, trainable=trainable)
layer_output_0 = tf.matmul(layer_input, w_collection) + b_collection
# layer_input_normalization = tf.keras.layers.BatchNormalization()(layer_input, training=True)
# batch_normalization=tf.keras.layers.BatchNormalization(name=layer_normalizer_scope)
# layer_output_normalization=batch_normalization(layer_output_0,training=True)
# tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, batch_normalization.updates)
layer_output_normalization=tf.layers.batch_normalization(layer_output_0,name=layer_normalizer_scope, training=True)
layer_output = tf.nn.leaky_relu(layer_output_normalization)
layer_output_dropout = tf.nn.dropout(layer_output,rate=dropout_rate)
return layer_output_dropout
梯度计算是这样完成的:
self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='actor_imitation')
self.loss_imitation=tf.reduce_mean(tf.squared_difference(self.a, self.A_I))
with tf.variable_scope('imitation_train'):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.opt_i = tf.train.MomentumOptimizer(self.lr_i,self.momentum)
self.i_grads = tf.gradients(ys=self.loss_imitation, xs=self.e_params, grad_ys=None)
self.train_imitation=self.opt_i.apply_gradients(zip(self.i_grads,self.e_params))
从self.i_grads = tf.gradients(ys=self.loss_imitation, xs=self.e_params, grad_ys=None)
计算出的部分梯度中的全部为零。从张量板中,我看到这部分是放置BN层之前的变量。这意味着BN层可以防止向后传播到他之前的层。对这里发生的事情有任何想法吗?
非常感谢!