Question

我正在训练以下维度的神经网络： 784（输入层） 45（隐藏层） 16（输出层）

使用反向传播（随机梯度下降）对数字和一些数学符号（0-9，+，-，*，/，[，]）进行分类

在选择小批量时，我正在做一些测试，发现了以下问题：

1。使用20个数据点的最小批量，反向传播算法“似乎”可以正常工作，但是即使在50多个纪元上对其进行训练后，其准确性似乎也只是波动并且变差（图1）

2。使用最小批处理大小为2000个数据点，权重梯度是如此之小，以至于它们在更新权重之后不会真正更改实际权重

下面，我将发布我用来训练神经网络对象的班级的相关代码。尽管并非所有内容都可见，但名称很不言自明。

一些相关数据：

训练数据集是约200k个数据点（28x28 numpy数组和相应符号的元组）
验证数据集约为5万个数据点
该算法将MSE用作成本函数
我正在使用以下公式的反向传播算法：

重要信息：为了使反向传播计算更加有效，我以批处理张量操作进行了计算，其中相应的梯度以张量为单位，其中第一个轴对应于数据集索引，其余的轴充当法线矩阵/向量。关于上一个问题的更多信息： Python: numpy.dot / numpy.tensordot for multidimensional arrays

示例（最小批量：20）

在最后一层激活：一个数据集的值为：（16x1）批量反向传播为：（20x16x1）

最后一层的权重渐变：一个数据集的值为：（16x45）批量反向传播为：（20x16x45）

import numpy as np
import random as rd
import time
import matplotlib.pyplot as plt

class NeuralNetworkTrainer:
  def __init__(self, neuralNetwork,validator):
    self.network = neuralNetwork # Uses the neural network object which contains weights, bias' as a list of numpy arrays for each layer (the first element being 'None' to be consistent with indexes), as well as activations and layer sizes
    self.eta = 0
    self.dataSet = [] # Another class loads the dataset here (tuples of inputs, 2D-numpy arrays of the image and outputs, the corresponding symbol)
   
    self.initializeWeightsBias()
    self.validator = validator # validator object
    self.validationAccuracy = [] # list of accuracies per epoch
    
  def initializeWeightsBias(self): #gradients initialization
    self.gradientToBias = [None]*len(self.network.layers)
    self.gradientToWeights = [None]*len(self.network.layers)

  def train(self,epochs,miniBatchSize,eta): #train algorithm
    self.eta = eta
    for i in range(0,epochs):
      self.shuffleData()
      for j in range(0,len(self.dataSet)//miniBatchSize):
        self.batchBackPropagation(self.createMiniBatch(miniBatchSize,j))
        self.update()

      correctOutputs, dataSetLength = self.validator.validate()
      self.validationAccuracy.append(round(correctOutputs/dataSetLength,4))
    
    return self.network

# ***************************
# BACKPROPAGATION ALGORITHM

  def batchBackPropagation(self,inputOutputBatch):
    self.initializeWeightsBias()

    activations = [None]*len(self.network.activations)
    for i in range(0,len(activations)): #Initialize activations
      activations[i] = np.empty((len(inputOutputBatch),self.network.activations[i].shape[0],self.network.activations[i].shape[1]))
    
    output = np.empty((len(inputOutputBatch),self.network.activations[-1].shape[0],self.network.activations[-1].shape[1])) #correct formatting of output vector out of the symbol (vector with 0's and a 1 in the corresponding output)
    for i in range(0,len(inputOutputBatch)):
      inputVector, outputVector = self.vectorizeInputOuput(inputOutputBatch[i])
      self.network.loadInput(inputVector)
      self.network.activate() #feedforward of input through the network with current weights/bias
      output[i] = outputVector
      for l in range(1,len(activations)): #creation of activation tensor as explained before
        activations[l][i] = self.network.activations[l]
    
    self.gradientToBias[-1] =(activations[-1]-output)*(activations[-1]-np.square(activations[-1])) #calculation of gradientBias for last layer for all the minibatches as a 3D tensor calculation (see algorithm image)
    for i in range(2,len(self.network.layers)):
      self.gradientToBias[-i] = np.tensordot(self.gradientToBias[-i+1],self.network.weights[-i+1],axes= ((1),(0))).transpose(0,2,1)*(activations[-i]-np.square(activations[-i])) #calculation of the rest of the gradientToBias for the rest of the layers as a 3D tensor calculation the first index being the index of the dataset in that minibatch (according to algorithm image)
    for i in range(1,len(self.network.layers)):
      self.gradientToWeights[i] = np.einsum('ijk,ilm->ijl',self.gradientToBias[i],activations[i-1])
    return self.network # analogous 3D tensor calculation of gradientToWeights for each dataset in the minibatch inside every layer of the 3D tensor

# *****************************

  def update(self): #reduction of gradients of each dataset to one final gradient to each parameter by summing over axis=0)
    for i in range(1,len(self.network.layers)):
      self.network.weights[i] -= self.eta*np.sum(self.gradientToWeights[i],axis =0)
      self.network.bias[i] -= self.eta*np.sum(self.gradientToBias[i], axis = 0)
    return self.network

  def shuffleData(self): #self explanatory
    rd.shuffle(self.dataSet)
    return self.network 

  def createMiniBatch(self, miniBatchSize, index): #self explanatory
    return self.dataSet[index*miniBatchSize:(index+1)*miniBatchSize] 

  def mapOutputToVector(self,output): #self explanatory
      outputVector = np.zeros((len(self.network.outputMap),1))
      outputVector[self.network.outputMap.index(output)] = 1
      return outputVector

  def vectorizeInputOuput(self,inputOutputData): #selfexplanatory
    return inputOutputData.input.flatten().reshape((-1,1)), self.mapOutputToVector(inputOutputData.output)

非常感谢您的帮助！

Answer 1

首先，小批量生产的尺寸太大通常会导致精度降低。

第一个图中面临的问题是过度拟合，因此您需要减少纪元数。

对于第二个图，您有200,000个样本，批次大小为2000，则该时期应包含200,000 / 2000 = 100步，这被认为是每个时期梯度中的步数少

通常，您需要为时期数和批量大小选择正确的数字，以获得最佳结果。也许在纪元中走了1000步，并且不要训练太多的纪元，以免过度拟合

Python反向传播：随着批量大小的增加，梯度变得越来越小

1 个答案: