Tensorflow:所有时期(第一个时期的第二批之后)的NaN训练损失,并且训练精度始终是恒定的

时间:2018-11-22 08:47:25

标签: python tensorflow

原始工具是正确的,仅包含loss1和loss2,所以我认为我的输入数据是正确的。

但是,在我添加了名为“ sw_loss”的loss3之后,训练损失始终是“ nan”,其目的是最小化“功能”行之间的L2范数。 “功能”是网络最后一层的输出。

实际上,训练损失在第一个时期的第二批中变为“ nan”,而第一批次的损失约为2.2。

主要代码如下:

features, _ = mnist_net(images) 

centers = func.construct_center(features, FLAGS.num_classes)
loss1 = func.dce_loss(features, labels, centers, FLAGS.temp)
loss2 = func.pl_loss(features, labels, centers)
loss3 = func.sw_loss(features, similarity_weight_batch)  #loss3 is defined in the following 
loss = loss1 + FLAGS.weight_pl * loss2 + FLAGS.weight_sw * loss3
eval_correct = func.evaluation(features, labels, centers)
train_op = func.training(loss, lr)

init = tf.global_variables_initializer()

# initialize the variables
sess = tf.Session()
sess.run(init)
#compute_centers(sess, add_op, count_op, average_op, images, labels, train_x, train_y)

# run the computation graph (train and test process)
epoch = 1
loss_before = np.inf
score_before = 0.0
stopping = 0
index = list(range(train_num))
np.random.shuffle(index)
batch_size = FLAGS.batch_size
batch_num = train_num//batch_size if train_num % batch_size==0 else train_num//batch_size+1
train_start= time.time()
while stopping<FLAGS.stop:
    time1 = time.time()
    loss_now = 0.0
    score_now = 0.0

    for i in range(batch_num):
        batch_x = train_x[index[i*batch_size:(i+1)*batch_size]]
        batch_y = train_y[index[i*batch_size:(i+1)*batch_size]]
        batch_index = np.asarray( index[i*batch_size:(i+1)*batch_size])
        weight_batch = np.zeros(shape=(batch_index.shape[0],batch_index.shape[0]))
        for j in range(batch_index.shape[0]):
            for k in range(batch_index.shape[0]):
                weight_batch[j,k] = similarity_weight[[batch_index[j,]],[batch_index[k,]]]
        result = sess.run([train_op, loss, eval_correct], feed_dict={images:batch_x,                
            labels:batch_y, lr:FLAGS.learning_rate, similarity_weight_batch:weight_batch})
        loss_now += result[1]
        score_now += result[2][1]
    score_now /= train_num

sw_loss在功能文件中定义如下:。

def sw_loss(features, similarity_weight_batch):  #'similarity_weight_batch' is the coefficients,which is between(0,1]. 
    sw_loss_total = 0.0
    sqdiff = tf.squared_difference(features[:, tf.newaxis], features)
    feature_matrix = tf.sqrt(tf.reduce_sum(sqdiff, axis=-1)) #calculate the L2 norm between the rows of the features
    sw_loss_total = tf.multiply(similarity_weight_batch,feature_matrix)                                                                                                                                     
    return tf.reduce_mean(sw_loss_total)

打印的日志如下,在所有时期训练损失都是'nan':

epoch 1: training: loss --> nan, acc --> 15.514%
time for this epoch: 0.074 minutes
epoch 2:  training: loss --> nan, acc --> 15.514%
time for this epoch: 0.024 minutes
epoch 3: training: loss --> nan, acc --> 15.514%
time for this epoch: 0.073 minutes
epoch 4: training: loss --> nan, acc --> 15.514%
time for this epoch: 0.033 minutes
epoch 5: training: loss --> nan, acc --> 15.514%
time for this epoch: 0.021 minutes
...

0 个答案:

没有答案