具有交叉熵损失的Softmax激活导致两个类的输出分别收敛到0和1

时间:2018-04-24 14:55:20

标签: python machine-learning neural-network classification softmax

我已经实现了一个只有一个S形隐藏层的简单神经网络,分别选择了sigmoid或softmax输出层和平方误差或交叉熵损失函数。经过对softmax激活函数,交叉熵损失及其衍生物(以及跟随this blog)的大量研究后,我相信我的实现似乎是正确的。

当尝试学习简单的XOR函数时,当使用0和1的单个二进制输出时,具有sigmoid输出的NN很快就会学到非常小的损失。但是,当将标签更改为[1]的热编码时,0] = 0且[0,1] = 1,softmax实现不起作用。对于每个输入上的两个输出,网络的输出精确地收敛到[0,1]时,损耗会不断增加,但数据集的标签在[0,1]和[1,0]之间完全平衡。

我的代码如下所示,可以通过取消注释代码底部附近的必要两行来选择在输出层使用sigmoid或softmax。我无法弄清楚为什么softmax实现不起作用。

import numpy as np


class MLP:

    def __init__(self, numInputs, numHidden, numOutputs, activation):
        self.numInputs = numInputs
        self.numHidden = numHidden
        self.numOutputs = numOutputs

        self.activation = activation.upper()

        self.IH_weights = np.random.rand(numInputs, numHidden)      # Input -> Hidden
        self.HO_weights = np.random.rand(numHidden, numOutputs)     # Hidden -> Output

        self.IH_bias = np.zeros((1, numHidden))
        self.HO_bias = np.zeros((1, numOutputs))

        # Gradients corresponding to weight matrices computed during backprop
        self.IH_w_gradients = np.zeros_like(self.IH_weights)
        self.HO_w_gradients = np.zeros_like(self.HO_weights)

        # Gradients corresponding to biases computed during backprop
        self.IH_b_gradients = np.zeros_like(self.IH_bias)
        self.HO_b_gradients = np.zeros_like(self.HO_bias)

        # Input, hidden and output layer neuron values
        self.I = np.zeros(numInputs)    # Inputs
        self.L = np.zeros(numOutputs)   # Labels
        self.H = np.zeros(numHidden)    # Hidden
        self.O = np.zeros(numOutputs)   # Output

    # ##########################################################################
    # ACIVATION FUNCTIONS
    # ##########################################################################

    def sigmoid(self, x, derivative=False):
        if derivative:
            return x * (1 - x)
        return 1 / (1 + np.exp(-x))

    def softmax(self, prediction, label=None, derivative=False):
        if derivative:
            return prediction - label
        return np.exp(prediction) / np.sum(np.exp(prediction))

    # ##########################################################################
    # LOSS FUNCTIONS
    # ##########################################################################

    def squaredError(self, prediction, label, derivative=False):
        if derivative:
            return (-2 * prediction) + (2 * label)
        return (prediction - label) ** 2

    def crossEntropy(self, prediction, label, derivative=False):
        if derivative:
            return [-(y / x) for x, y in zip(prediction, label)]    # NOT NEEDED ###############################
        return - np.sum([y * np.log(x) for x, y in zip(prediction, label)])

    # ##########################################################################

    def forward(self, inputs):
        self.I = np.array(inputs).reshape(1, self.numInputs)    # [numInputs, ] -> [1, numInputs]
        self.H = self.I.dot(self.IH_weights) + self.IH_bias
        self.H = self.sigmoid(self.H)
        self.O = self.H.dot(self.HO_weights) + self.HO_bias

        if self.activation == 'SIGMOID':
            self.O = self.sigmoid(self.O)
        elif self.activation == 'SOFTMAX':
            self.O = self.softmax(self.O) + 1e-10   # allows for log(0)

        return self.O

    def backward(self, labels):
        self.L = np.array(labels).reshape(1, self.numOutputs)   # [numOutputs, ] -> [1, numOutputs]

        if self.activation == 'SIGMOID':
            self.O_error = self.squaredError(self.O, self.L)
            self.O_delta = self.squaredError(self.O, self.L, derivative=True) * self.sigmoid(self.O, derivative=True)
        elif self.activation == 'SOFTMAX':
            self.O_error = self.crossEntropy(self.O, self.L)
            self.O_delta = self.softmax(self.O, self.L, derivative=True)

        self.H_error = self.O_delta.dot(self.HO_weights.T)
        self.H_delta = self.H_error * self.sigmoid(self.H, derivative=True)

        self.IH_w_gradients += self.I.T.dot(self.H_delta)
        self.HO_w_gradients += self.H.T.dot(self.O_delta)

        self.IH_b_gradients += self.H_delta
        self.HO_b_gradients += self.O_delta

        return self.O_error

    def updateWeights(self, learningRate):
        self.IH_weights += learningRate * self.IH_w_gradients
        self.HO_weights += learningRate * self.HO_w_gradients
        self.IH_bias += learningRate * self.IH_b_gradients
        self.HO_bias += learningRate * self.HO_b_gradients

        self.IH_w_gradients = np.zeros_like(self.IH_weights)
        self.HO_w_gradients = np.zeros_like(self.HO_weights)
        self.IH_b_gradients = np.zeros_like(self.IH_bias)
        self.HO_b_gradients = np.zeros_like(self.HO_bias)


sigmoidData = [
    [[0, 0], 0],
    [[0, 1], 1],
    [[1, 0], 1],
    [[1, 1], 0]
]

softmaxData = [
    [[0, 0], [1, 0]],
    [[0, 1], [0, 1]],
    [[1, 0], [0, 1]],
    [[1, 1], [1, 0]]
]

sigmoidMLP = MLP(2, 10, 1, 'SIGMOID')
softmaxMLP = MLP(2, 10, 2, 'SOFTMAX')

# SIGMOID #######################
# data = sigmoidData
# mlp = sigmoidMLP
# ###############################

# SOFTMAX #######################
data = softmaxData
mlp = softmaxMLP
# ###############################

numEpochs = 5000
for epoch in range(numEpochs):
    losses = []
    for i in range(len(data)):
        print(mlp.forward(data[i][0]))      # Print outputs
        # mlp.forward(data[i][0])           # Don't print outputs
        loss = mlp.backward(data[i][1])
        losses.append(loss)
    mlp.updateWeights(0.001)
    # if epoch % 1000 == 0 or epoch == numEpochs - 1:   # Print loss every 1000 epochs
    print(np.mean(losses))                              # Print loss every epoch

1 个答案:

答案 0 :(得分:1)

与在线的所有信息相反,只需将softmax交叉熵的导数从prediction - label更改为label - prediction即可解决问题。也许我还有其他的东西向后退,因为我遇到的每个来源都有prediction - label