Question

我对张量流很新，并且一直在查看示例here。我想将多层感知器分类模型重写为回归模型。但是在修改损失函数时遇到了一些奇怪的行为。它适用于tf.reduce_mean，但如果我尝试使用tf.reduce_sum，它会在输出中提供nan。这看起来很奇怪，因为函数非常相似 - 唯一的区别是均值将总和结果除以元素数量？所以我无法看到这种变化会如何引入纳米？

import tensorflow as tf

# Parameters
learning_rate = 0.001

# Network Parameters
n_hidden_1 = 32 # 1st layer number of features
n_hidden_2 = 32 # 2nd layer number of features
n_input = 2 # number of inputs
n_output = 1 # number of outputs

# Make artificial data
SAMPLES = 1000
X = np.random.rand(SAMPLES, n_input)
T = np.c_[X[:,0]**2 + np.sin(X[:,1])]

# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_output])

# Create model
def multilayer_perceptron(x, weights, biases):
    # Hidden layer with tanh activation
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.tanh(layer_1)
    # Hidden layer with tanh activation
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.tanh(layer_2)
    # Output layer with linear activation
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    return out_layer

# Store layers weight & bias
weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_hidden_2, n_output]))
}
biases = {
    'b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_output]))
}

pred = multilayer_perceptron(x, weights, biases)

# Define loss and optimizer
#se = tf.reduce_sum(tf.square(pred - y))   # Why does this give nans?
mse = tf.reduce_mean(tf.square(pred - y))  # When this doesn't?
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)

# Initializing the variables
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

training_epochs = 10
display_step = 1

# Training cycle
for epoch in range(training_epochs):
    avg_cost = 0.
    # Loop over all batches
    for i in range(100):
        # Run optimization op (backprop) and cost op (to get loss value)
        _, msev = sess.run([optimizer, mse], feed_dict={x: X, y: T})
    # Display logs per epoch step
    if epoch % display_step == 0:
        print("Epoch:", '%04d' % (epoch+1), "mse=", \
            "{:.9f}".format(msev))

有问题的变量se已被注释掉。它应该用来代替mse。

使用mse输出如下所示：

Epoch: 0001 mse= 0.051669389
Epoch: 0002 mse= 0.031438075
Epoch: 0003 mse= 0.026629323
...

和se最终会像这样结束：

Epoch: 0001 se= nan
Epoch: 0002 se= nan
Epoch: 0003 se= nan
...

Answer 1

通过批量累加造成的损失大1000倍（从略读代码我认为你的训练批量大小是1000）所以你的渐变和参数更新也是1000倍。较大的更新显然会导致nan s。

一般来说，学习率是以每个例子的形式表示的，因此找到更新梯度的损失也应该是每个例子。如果损失是每批，则需要通过批量大小减少学习率以获得可比较的培训结果。

Answer 2

如果您使用 reduce_sum 而不是 reduce_mean ，则渐变会更大。因此，您应该相应地缩小学习率，以确保培训过程能够正确进行。

Answer 3

在大多数文献中，损失表示为批次损失的平均值。如果使用reduce_mean()计算损失，则学习率应视为每批较大。

似乎在 tensorflow.keras.losses 中，人们仍然在均值或总和之间进行选择。例如，在 tf.keras.losses.Huber 中，默认值为均值。但您可以将其设置为 sum。

损失函数适用于reduce_mean但不适用于reduce_sum

3 个答案: