我想在以下神经网络上执行SGD:
Training set size = 200000
input layer size = 784
hidden layer size = 50
output layer size = 10
我有一个执行批量梯度下降的算法。我想要执行SGD,应修改成本函数以对单个训练数据(大小为784的数组)执行计算,然后应针对每个训练数据更新theta。这是实施SGD的正确方法吗?如果是,我无法获得以下成本函数(批量梯度下降)来处理单个训练数据。如何在单个训练集上运行?如果不是,那么在神经网络上实现SGD的正确方法是什么?
用于计算批量梯度下降的成本函数和θ的梯度的python函数:
def cost(theta,X,y,lamb):
#get theta1 and theta2 from unrolled theta vector
th1 = (theta[0:(hiddenLayerSize*(inputLayerSize+1))].reshape((inputLayerSize+1,hiddenLayerSize))).T
th2 = (theta[(hiddenLayerSize*(inputLayerSize+1)):].reshape((hiddenLayerSize+1,outputLayerSize))).T
#matrices to store gradient of theta1 &theta2
th1_grad = np.zeros(th1.shape)
th2_grad = np.zeros(th2.shape)
I = np.identity(outputLayerSize,int)
Y = np.zeros((realTrainSetSize ,outputLayerSize))
#get Y[i] to the size of output Layer
for i in range(0,realTrainSetSize ):
Y[i] = I[y[i]]
#add bais unit in each training example and perform forward prop and backprop
A1 = np.hstack([np.ones((realTrainSetSize ,1)),X])
Z2 = A1 @ (th1.T)
A2 = np.hstack([np.ones((len(Z2),1)),sigmoid(Z2)])
Z3 = A2 @ (th2.T)
H = A3 = sigmoid(Z3)
penalty = (lamb/(2*trainSetSize))*(sum(sum(np.delete(th1,0,1)**2))+ sum(sum(np.delete(th2,0,1)**2)) )
J = (1/2)*sum(sum( np.multiply(-Y,log(H)) - np.multiply((1-Y),log(1-H)) ))
#backprop
sigma3 = A3 - Y;
sigma2 = np.multiply(sigma3@theta2,sigmoidGradient(np.hstack([np.ones((len(Z2),1)),Z2])))
sigma2 = np.delete(sigma2,0,1)
delta_1 = sigma2.T @ A1 #getting dimension mismatch error
delta_2 = sigma3.T @ A2
#calculation of gradient of theta1 and theta2
th1_grad = np.divide(delta_1,trainSetSize)+(lamb/trainSetSize)*(np.hstack([np.zeros((len(th1),1)) , np.delete(th1,0,1)]))
th2_grad = np.divide(delta_2,trainSetSize)+(lamb/trainSetSize)*(np.hstack([np.zeros((len(th2),1)) , np.delete(th2,0,1)]))
#unroll gradients of theta1 and theta2
theta_grad = np.concatenate(((th1_grad.T).ravel(),(th2_grad.T).ravel()))
return (J,theta_grad)
我在使用单个训练数据调用此函数时计算delta_1
和delta_2
时出现尺寸不匹配错误,但在整个训练批次调用时它可以正常工作。