我正在使用cntk和tesnorflow训练相同的CNN架构。但是,当我使用较大的学习率(例如0.05 cntk.sgd
)时,tf.train.GradientDescentOptimizer
会产生nan
值。我试图调试两者,我看到张量流在第一次迭代后产生了更大的输出,这是由于网络中的大权重。
gradient_clipping_with_truncation
设置了False
到cntk.sgd()
。tf.layers.conv2d
表示张量流,cntk.layers.Convolution
表示cntk代码:这是tensorflow的伪代码:
conv1_1 = tf.layers.conv2d(inputs=X_reshaped, filters=64, kernel_size=[3, 3], padding="same", activation=tf.nn.relu)
...
logits = tf.layers.dense(inputs=drop6, units=n_outputs, name="raw_outputs")
loss = tf.reduce_mean(tf.losses.softmax_cross_entropy(logits=logits, onehot_labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.05)
training_op = optimizer.minimize(loss=loss)
_, batch_loss = sess.run([training_op, loss], feed_dict={X: X_batch, y: y_batch, training: True})
这是cntk的伪代码:
def cost_func(prediction, target):
train_loss = ct.negate(ct.reduce_sum(ct.element_times(target, ct.log(prediction)), axis=-1))
ct.layers.Convolution((3,3), [64,128][i], pad=True)
ct.layers.Dense(num_classes, activation=None, name='output')
z = model.model(input_var)
train_loss = cost_func(training_mode, pred, label_var)
pe = ct.classification_error(z, label_var)
learner = ct.sgd(z.parameters, lr=0.05, gradient_clipping_with_truncation=False)
trainer = ct.Trainer(z, (train_loss, pe), learner)
trainer.train_minibatch({input_var: images, label_var: labels})
training_loss += trainer.previous_minibatch_loss_average * current_batch_size
UPDATE:所以两个框架的输出显然在最后一层变得很大(例如,2.6e + 09)。所以对于任何模型都没有截断肯定。还在调查cntk如何处理它,但是不能这样做!