我正在尝试在给定步骤的训练期间停止对批次归一化变量的训练。为此,我编写了以下代码:
early_training_stop = 1000
global_step = tf.train.get_or_create_global_step()
train_phase = tf.cond(global_step < early_training_stop, lambda: True, lambda: False)
out = tf.layers.conv2D(inputs, filters=16, kernel_size=4)
out_bn = tf.contrib.layers.batch_norm(
input_layer,
decay=decay,
scale=scale,
epsilon=epsilon,
is_training=train_phase,
fused=True,
data_format=self.data_format,
scope=scope)
loss_op = tf.losses.mean_squared_error(labels=targets, predictions=inputs)
learning_rate = 1e-4
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_op = optimizer_op.minimize(loss_op, global_step=global_step)
训练网络时,它可以正常启动(训练了批量标准变量)。不幸的是,当global_step
到达early_training_stop
时,出现以下错误:Model diverged with loss = NaN.
。如何避免这个错误?也许在从训练变量中删除batch_norm ops? (我这样做虽然可以做到)。