让我们删除最后一行数据并将其用作看不见的测试。

Question

已知XOR问题由给定所有4个布尔输入和输出的多层感知器解决，它训练并记忆重现I / O所需的权重。 E.g。

mutate_sqrt <- function(df, string){
  string_col <- grep(string, names(df), value = TRUE)
  df2 <- df %>% mutate_at(vars(contains(string)), funs("sqrt" = sqrt(.)))
  if (length(string_col) == 1){
    df2 <- df2 %>%  setNames(sub("^sqrt$", paste(string_col, "sqrt", sep = "_"), names(.)))
  }
  return(df2)
}

mutate_sqrt(df, "other")
#   var1 var2 other other_sqrt
# 1    1    4     9          3
# 2    2    5     9          3

mutate_sqrt(df2, "other")
#   var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1    1    4      9     16           3           4
# 2    2    5      9     16           3           4

我们看到我们已经完全训练了网络来记忆XOR的输出：

import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def cost(predicted, truth):
    return truth - predicted

xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T

X = xor_input
Y = xor_output

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

num_epochs = 10000
learning_rate = 1.0

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = cost(layer2, Y)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)


    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 +=  learning_rate * np.dot(layer1.T, layer2_delta)
    W1 +=  learning_rate * np.dot(layer0.T, layer1_delta)

[OUT]：

# On the training data
[int(prediction > 0.5) for prediction in layer2]

如果我们重新输入相同的输入，我们会得到相同的输出：

[0, 1, 1, 0]

[OUT]：

for x, y in zip(X, Y):
    layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
    prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
    print(int(prediction > 0.5), y)

但如果我们重新训练参数（W1和W2）而没有一个数据点，即

0 [0]
1 [1]
1 [1]
0 [0]

让我们删除最后一行数据并将其用作看不见的测试。

xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T

使用相同代码的其余部分，无论我如何更改超参数，它都无法学习XOR函数并重现I / O.

X = xor_input[:-1]
Y = xor_output[:-1]

[OUT]：

for x, y in zip(xor_input, xor_output):
    layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
    prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
    print(int(prediction > 0.5), y)

即使我们将in- / output输入：

0 [0]
1 [1]
1 [1]
1 [0]

我们无法完全训练XOR功能：'

# Shuffle the order of the inputs
_temp = list(zip(X, Y))
random.shuffle(_temp)
xor_input_shuff, xor_output_shuff = map(np.array, zip(*_temp))

[OUT]：

for x, y in zip(xor_input, xor_output):
    layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
    prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
    print(x, int(prediction > 0.5), y)

因此，当文献指出多层感知器（Aka基本深度学习）解决XOR时，是否意味着它可以完全学习和记忆权重给定完整的输入/输出但不能如果缺少一个数据点，则推广XOR问题？

以下是Kaggle数据集的链接，回答者可以自己测试网络：https://www.kaggle.com/alvations/xor-with-mlp/

Answer 1

我认为学习（概括）XOR和记忆XOR是不同的事情。

如你所见，双层感知器可以记忆XOR，即存在权重组合，其中损失最小且等于0（绝对最小值）。

如果权重是随机初始化的，那么你最终可能会得到实际学习XOR而不仅仅是记忆的情况。

请注意，多层感知器是非凸函数，因此可能存在多个最小值（甚至多个全局最小值）。当数据缺少一个输入时，有多个最小值（并且所有值都相等）并且存在最小值，其中缺失点将被正确分类。因此，MLP可以学习异或。（虽然发现体重组合可能很难找到缺失点）。

人们经常认为神经网络是通用函数逼近器，甚至可以逼近非感知标签。有鉴于此，您可能希望查看这项工作https://arxiv.org/abs/1611.03530

使用多层感知器解决具有3个数据点的XOR

让我们删除最后一行数据并将其用作看不见的测试。

即使我们将in- / output输入：

1 个答案: