Python反向传播:随着批量大小的增加,梯度变得越来越小

时间:2020-06-25 00:36:28

标签: python machine-learning neural-network gradient-descent backpropagation

我正在训练以下维度的神经网络: 784(输入层) 45(隐藏层) 16(输出层)

使用反向传播(随机梯度下降)对数字和一些数学符号(0-9,+,-,*,/,[,])进行分类

在选择小批量时,我正在做一些测试,发现了以下问题:

1。使用20个数据点的最小批量,反向传播算法“似乎”可以正常工作,但是即使在50多个纪元上对其进行训练后,其准确性似乎也只是波动并且变差(图1)

Epoch vs Accuracy 1 = 100%

2。使用最小批处理大小为2000个数据点,权重梯度是如此之小,以至于它们在更新权重之后不会真正更改实际权重

enter image description here

下面,我将发布我用来训练神经网络对象的班级的相关代码。尽管并非所有内容都可见,但名称很不言自明。

一些相关数据:

  1. 训练数据集是约200k个数据点(28x28 numpy数组和相应符号的元组)
  2. 验证数据集约为5万个数据点
  3. 该算法将MSE用作成本函数
  4. 我正在使用以下公式的反向传播算法:

Backpropagation algorithm

  1. 重要信息:为了使反向传播计算更加有效,我以批处理张量操作进行了计算,其中相应的梯度以张量为单位,其中第一个轴对应于数据集索引,其余的轴充当法线矩阵/向量。关于上一个问题的更多信息: Python: numpy.dot / numpy.tensordot for multidimensional arrays

示例(最小批量:20)

在最后一层激活: 一个数据集的值为:(16x1) 批量反向传播为:(20x16x1)

最后一层的权重渐变: 一个数据集的值为:(16x45) 批量反向传播为:(20x16x45)

import numpy as np
import random as rd
import time
import matplotlib.pyplot as plt

class NeuralNetworkTrainer:
  def __init__(self, neuralNetwork,validator):
    self.network = neuralNetwork # Uses the neural network object which contains weights, bias' as a list of numpy arrays for each layer (the first element being 'None' to be consistent with indexes), as well as activations and layer sizes
    self.eta = 0
    self.dataSet = [] # Another class loads the dataset here (tuples of inputs, 2D-numpy arrays of the image and outputs, the corresponding symbol)
   
    self.initializeWeightsBias()
    self.validator = validator # validator object
    self.validationAccuracy = [] # list of accuracies per epoch
    
  def initializeWeightsBias(self): #gradients initialization
    self.gradientToBias = [None]*len(self.network.layers)
    self.gradientToWeights = [None]*len(self.network.layers)

  def train(self,epochs,miniBatchSize,eta): #train algorithm
    self.eta = eta
    for i in range(0,epochs):
      self.shuffleData()
      for j in range(0,len(self.dataSet)//miniBatchSize):
        self.batchBackPropagation(self.createMiniBatch(miniBatchSize,j))
        self.update()

      correctOutputs, dataSetLength = self.validator.validate()
      self.validationAccuracy.append(round(correctOutputs/dataSetLength,4))
    
    return self.network

# ***************************
# BACKPROPAGATION ALGORITHM

  def batchBackPropagation(self,inputOutputBatch):
    self.initializeWeightsBias()

    activations = [None]*len(self.network.activations)
    for i in range(0,len(activations)): #Initialize activations
      activations[i] = np.empty((len(inputOutputBatch),self.network.activations[i].shape[0],self.network.activations[i].shape[1]))
    
    output = np.empty((len(inputOutputBatch),self.network.activations[-1].shape[0],self.network.activations[-1].shape[1])) #correct formatting of output vector out of the symbol (vector with 0's and a 1 in the corresponding output)
    for i in range(0,len(inputOutputBatch)):
      inputVector, outputVector = self.vectorizeInputOuput(inputOutputBatch[i])
      self.network.loadInput(inputVector)
      self.network.activate() #feedforward of input through the network with current weights/bias
      output[i] = outputVector
      for l in range(1,len(activations)): #creation of activation tensor as explained before
        activations[l][i] = self.network.activations[l]
    
    self.gradientToBias[-1] =(activations[-1]-output)*(activations[-1]-np.square(activations[-1])) #calculation of gradientBias for last layer for all the minibatches as a 3D tensor calculation (see algorithm image)
    for i in range(2,len(self.network.layers)):
      self.gradientToBias[-i] = np.tensordot(self.gradientToBias[-i+1],self.network.weights[-i+1],axes= ((1),(0))).transpose(0,2,1)*(activations[-i]-np.square(activations[-i])) #calculation of the rest of the gradientToBias for the rest of the layers as a 3D tensor calculation the first index being the index of the dataset in that minibatch (according to algorithm image)
    for i in range(1,len(self.network.layers)):
      self.gradientToWeights[i] = np.einsum('ijk,ilm->ijl',self.gradientToBias[i],activations[i-1])
    return self.network # analogous 3D tensor calculation of gradientToWeights for each dataset in the minibatch inside every layer of the 3D tensor

# *****************************

  def update(self): #reduction of gradients of each dataset to one final gradient to each parameter by summing over axis=0)
    for i in range(1,len(self.network.layers)):
      self.network.weights[i] -= self.eta*np.sum(self.gradientToWeights[i],axis =0)
      self.network.bias[i] -= self.eta*np.sum(self.gradientToBias[i], axis = 0)
    return self.network

  def shuffleData(self): #self explanatory
    rd.shuffle(self.dataSet)
    return self.network 

  def createMiniBatch(self, miniBatchSize, index): #self explanatory
    return self.dataSet[index*miniBatchSize:(index+1)*miniBatchSize] 

  def mapOutputToVector(self,output): #self explanatory
      outputVector = np.zeros((len(self.network.outputMap),1))
      outputVector[self.network.outputMap.index(output)] = 1
      return outputVector

  def vectorizeInputOuput(self,inputOutputData): #selfexplanatory
    return inputOutputData.input.flatten().reshape((-1,1)), self.mapOutputToVector(inputOutputData.output)
  

非常感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

首先,小批量生产的尺寸太大通常会导致精度降低。

第一个图中面临的问题是过度拟合,因此您需要减少纪元数。

对于第二个图,您有200,000个样本,批次大小为2000,则该时期应包含200,000 / 2000 = 100步,这被认为是每个时期梯度中的步数少

通常,您需要为时期数和批量大小选择正确的数字,以获得最佳结果。也许在纪元中走了1000步,并且不要训练太多的纪元,以免过度拟合