已编辑

Question

给出异或问题：

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

一个简单的

具有
S型激活
均方误差（MSE）作为损失函数/优化准则

[代码]：

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx): # For backpropagation.
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.sum(np.square(truth - predicted))

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

并给出停止标准为固定编号。固定学习率为0.3的时期（通过X和Y进行的迭代次数）：

# Initialize weigh
num_epochs = 10000
learning_rate = 0.3

当我进行前后传播并在每个时期更新权重时，我应该如何更新权重？

我试图简单地将学习速率的乘积与反向传播导数的点积与层输出相加，但是该模型仍然仅在一个方向上更新了权重，导致所有权重降到接近零。

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = mse(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())

权重应如何正确更新？

完整代码：

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.sum(np.square(truth - predicted))

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

# Initialize weigh
num_epochs = 10000
learning_rate = 0.3

losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = mse(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())

# Visualize the losses
plt.plot(losses)
plt.show()

我在反向传播过程中缺少任何东西吗？

也许我错过了从成本到第二层的导数？

已编辑

我意识到我错过了从成本到第二层以及添加后的偏导数：

# Cost functions.
def mse(predicted, truth):
    return 0.5 * np.sum(np.square(predicted - truth)).mean()

def mse_derivative(predicted, truth):
    return predicted - truth

具有跨各个时期的更新的反向传播循环：

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    cost_error = mse(layer2, Y)
    cost_delta = mse_derivative(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_error = np.dot(cost_delta, cost_error)
    layer2_delta = layer2_error *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)

似乎在训练和学习XOR ...

但是现在问题来了，layer2_error和layer2_delta的计算是否正确，即以下代码部分正确吗？

# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)

#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = np.dot(cost_delta, cost_error)
layer2_delta = layer2_error *  sigmoid_derivative(layer2)

在cost_delta和cost_error上对layer2_error进行点积运算是否正确？还是layer2_error等于cost_delta？

即

# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)

#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = cost_delta
layer2_delta = layer2_error *  sigmoid_derivative(layer2)

Answer 1

是的，在更新权重时，将残差（cost_error）与增量值相乘是正确的。

但是，因为点积cost_error是标量，所以是否进行点积并不重要。因此，一个简单的乘法就足够了。但是，我们绝对必须乘以成本函数的梯度，因为这是我们开始反向传播的地方（即它是向后传递的入口）。

此外，可以简化以下功能：

def mse(predicted, truth):
    return 0.5 * np.sum(np.square(predicted - truth)).mean()

为

def mse(predicted, truth):
    return 0.5 * np.mean(np.square(predicted - truth))

如何更新两层多层感知器中的学习率？

已编辑

1 个答案: