尝试在校准期间停止batch_norm训练时,张量流模型的发散

时间:2018-07-27 14:43:59

标签: python tensorflow

我正在尝试在给定步骤的训练期间停止对批次归一化变量的训练。为此,我编写了以下代码:

early_training_stop = 1000
global_step = tf.train.get_or_create_global_step()
train_phase = tf.cond(global_step < early_training_stop, lambda: True, lambda: False)

out = tf.layers.conv2D(inputs, filters=16, kernel_size=4)
out_bn = tf.contrib.layers.batch_norm(
        input_layer,
        decay=decay,
        scale=scale,
        epsilon=epsilon,
        is_training=train_phase,
        fused=True,
        data_format=self.data_format,
        scope=scope)

loss_op = tf.losses.mean_squared_error(labels=targets, predictions=inputs) 
learning_rate = 1e-4
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
    train_op = optimizer_op.minimize(loss_op, global_step=global_step)

训练网络时,它可以正常启动(训练了批量标准变量)。不幸的是,当global_step到达early_training_stop时,出现以下错误:Model diverged with loss = NaN.。如何避免这个错误?也许在从训练变量中删除batch_norm ops? (我这样做虽然可以做到)。

0 个答案:

没有答案