Question

我着手使用数学方法使用梯度下降来学习反向传播，从而无需使用Keras之类的库即可掌握事物的工作原理。

我从网络上获取了一个示例程序，并确保我试图理解每个步骤。它使用以下内容： 1）三层网络。输入具有784列或特征，其像素值为0-255 2）1个具有250个神经元的隐藏节点 3）1个输出节点和1个神经元 4）-1和1之间随机生成的两层的权重 5）在每个时期将整批培训以学习率为0.1

import numpy as np
dataset = np.loadtxt(open("train.csv", "rb"), delimiter=",",skiprows=1,dtype=float)
X = dataset[:,1:]
y = dataset[:,0]
print(X.shape,y.shape)
X = X/255
y = y/10
y = np.reshape(y,(len(y),1)) ## Necessary to avoid mismatching dimensions

def sigmoid(x, derive=False):
   if derive:
     return x * (1 - x)
   return 1 / (1 + np.exp(-x))

# Define a learning rate
eta = 0.1
# Define the number of epochs for learning
epochs = 500000


w01 = np.random.uniform(low=-1, high=1, size=(784,250))
w12 = np.random.uniform(low=-1, high=1, size=(250,1))
# Start feeding forward and backpropagate *epochs* times.
for epoch in range(epochs):
   # Feed forward
   z_h = np.dot(X, w01)
   a_h = sigmoid(z_h)
   z_o = np.dot(a_h, w12)
   a_o = sigmoid(z_o)
   # Calculate the error
   a_o_error = ((1 / 2) * (np.power((a_o - y), 2)))
   #a_o_error = y-a_o
   # Backpropagation
   ## Output layer
   delta_a_o_error = a_o - y
   delta_z_o = sigmoid(a_o,derive=True)
   delta_w12 = a_h
   delta_output_layer = np.dot(delta_w12.T,(delta_a_o_error * delta_z_o))

   ## Hidden layer
   delta_a_h = np.dot(delta_a_o_error * delta_z_o, w12.T)
   delta_z_h = sigmoid(a_h,derive=True)
   delta_w01 = X
   delta_hidden_layer = np.dot(delta_w01.T, delta_a_h * delta_z_h)
   w01 = w01 - eta * delta_hidden_layer
   w12 = w12 - eta * delta_output_layer
   if epoch % 100 == 0:    
     print ("Loss at epoch "+str(epoch)+":"+str(np.mean(np.square(y - a_o))))


#Testing:
X_Test = X[129] 
Y_Test = y[129]  

z_h = np.dot(X_Test, w01)
a_h = sigmoid(z_h)
z_o = np.dot(a_h, w12)
a_o = sigmoid(z_o)

print("Expected Output:",Y_Test*10) 
print("Actual Output got:",a_o*10)

这是我的问题： 1）我无法用42k样本提供整个MNIST数据集，因为我认为神经网络在小批量生产中会更好，而且我需要使用较小的数据集进行快速POC 2）我将总输入减少到500行，并且NN可以正确预测从任何输入行输入的数字 3）但是，当我将样本输入增加到接近3k时，损耗根本不会改变。我试着以学习速度或隐藏层神经元的数量进行游戏，但是没有变化

可以从以下位置下载数据： www.kaggle.com/c/digit-recognizer/data

我将train.csv文件修整到大约3k行，以便可以进行输入。

有人可以帮助我更好地理解这一点吗，这可以使我的样本数据集更有效。我已经花了一周的时间，但仍然没有放弃，我唯一可以尝试实现的方法是在该程序内创建迷你批处理，但仍在评估如何做，因为我不是来自编程背景。

感谢您阅读我的问题和耐心等待。

问候 Chandan Jha

Answer 1

一些我认为可以改善您的实现的建议：

对于MNIST数据集，请考虑在最后一层而不是S型中使用softmax回归函数。您有给定输入可能属于的多个类（0，1，2，... 9）。 Sigmoid-在这种情况下，二进制分类器无用。使用Softmax，您的输出将是10种可能性（0-9）中概率最大的数字。
将数据集的标签预处理为一键向量格式（其中每个标签将是大小为10的向量，并且唯一具有所需输出编号的索引将是1，其余为0）
除非您因局部批次梯度下降而陷入局部最小值，否则应该观察到迭代过程中损失的减少。使用小批量可以帮助以健壮的方式收敛。您可以使用上面提到的现有代码，并将其放入类似以下的结构中： start_pos = 0 mini_batch_size = 64 #Use suitable batch size that would fit in your memory(Typically use #size that is a power of 2) num_complete_batches = int(len(X) / mini_batch_size) for epoch in range(epochs): for curr_batch in range(num_complete_batches): current_x = X[:, curr_batch*mini_batch_size : curr_batch*mini_batch_size + \ mini_batch_size] current_y = Y[:, curr_batch*mini_batch_size : curr_batch*mini_batch_size + \ mini_batch_size] #Forward-Backward pass #If you have left-over examples that did not fit in complete batches, now feed those #into the network
使用更深入的网络。

损失没有改变：带有MNIST数据集的Python 3.6的反向传播

1 个答案: