Question

我第一次尝试使用GRU RNN在20年左右的时间序列中训练包含8个变量的数据集。生物量值是我要根据其他变量预测的值。我首先尝试使用1层GRU。我没有在输出层使用softmax。 MSE用于我的成本函数。

它是具有向前传播和向后渐变更新的基本GRU。这是我定义的主要功能：

   'x_t is the input training dataset with a dimension of 7572x8. So T = 7572, input_dim = 8, hidden_dim =128. y_train is my train label.'

   def forward_prop_step(self, x_t,y_train, s_t1_prev,V, U, W, b, c,learning_rate):
       T = x_t.shape[0]
       z_t1 = np.zeros((T,self.hidden_dim))
       r_t1 = np.zeros((T,self.hidden_dim))
       h_t1 = np.zeros((T,self.hidden_dim))
       s_t1 = np.zeros((T+1,self.hidden_dim))
       o_s = np.zeros((T,self.input_dim))
       for i in xrange(T):
           x_e = x_t[i].T
           z_t1[i] = sigmoid(U[0].dot(x_e) + W[0].dot(s_t1[i]) + b[0])#128x1
           r_t1[i] = sigmoid(U[1].dot(x_e) + W[1].dot(s_t1[i]) + b[1])#128x1
           h_t1[i] = np.tanh(U[2].dot(x_e) + W[2].dot(s_t1[i] * r_t1[i]) + b[2])#128x1
           s_t1[i+1] = (np.ones_like(z_t1[i]) - z_t1[i]) * h_t1[i] + z_t1[i] * s_t1[i]#128x1

           o_s[i] = np.dot(V,s_t1[i+1]) + c#8x1
       return [o_s,z_t1,r_t1,h_t1,s_t1]

   def bptt(self, x,y_train,o,z_t1,r_t1,h_t1,s_t1,V, U, W, b, c):
       bptt_truncate = 360
       T = x.shape[0]#length of time scale of input data (train)
       dLdU = np.zeros(U.shape)
       dLdV = np.zeros(V.shape)
       dLdW = np.zeros(W.shape)
       dLdb = np.zeros(b.shape)
       dLdc = np.zeros(c.shape)
       y_train_sp = np.repeat(y_train,self.input_dim)
       for t in np.arange(T)[::-1]:
           dLdy = 2 * (o[t] - y_train_sp[t])
           dydV = s_t1[t]
           dydc = 1.0
           dLdV += np.outer(dLdy,dydV)
           dLdc += dLdy*dydc            
           for i in np.arange(max(0, t-bptt_truncate), t+1)[::-30]:#every month in the past year           
               s_t1_pre = s_t1[i]          
               dydst1 = V #8x128                
               dst1dzt1 = -h_t1[i] + s_t1_pre #128x1
               dst1dht1 = np.ones_like(z_t1[i]) - z_t1[i] #128x1

               dzt1dU = np.outer(z_t1[i]*(1.0-z_t1[i]),x[i]) #128x8
               #print dzt1dU.shape
               dzt1dW = np.outer(z_t1[i]*(1.0-z_t1[i]),s_t1_pre)  #128x128
               dzt1db = z_t1[i]*(1.0-z_t1[i]) #128x1

               dht1dU = np.outer((1.0-h_t1[i] ** 2),x[i]) #128x8
               dht1dW = np.outer((1.0-h_t1[i] ** 2),s_t1_pre * r_t1[i])  #128x128
               dht1db = 1.0-h_t1[i] ** 2 #128x1

               dht1drt1 = (1.0-h_t1[i] ** 2)*(W[2].dot(s_t1_pre))#128x1

               drt1dU = np.outer((r_t1[i]*(1.0-r_t1[i])),x[i]) #128x8
               drt1dW = np.outer((r_t1[i]*(1.0-r_t1[i])),s_t1_pre) #128x128
               drt1db = (r_t1[i]*(1.0-r_t1[i]))#128x1
               dLdW[0] += np.outer(dydst1.T.dot(dLdy),dzt1dW.dot(dst1dzt1)) #128x128
               dLdU[0] += np.outer(dydst1.T.dot(dLdy),dst1dzt1.dot(dzt1dU)) #128x8
               dLdb[0] += (dydst1.T.dot(dLdy))*dst1dzt1*dzt1db#128x1

               dLdW[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dW)#128x128
               dLdU[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dU) #128x8
               dLdb[1] += (dydst1.T.dot(dLdy))*dst1dht1*dht1drt1*drt1db#128x1

               dLdW[2] += np.outer(dydst1.T.dot(dLdy),dht1dW.dot(dst1dht1))  #128x128
               dLdU[2] += np.outer(dydst1.T.dot(dLdy),dst1dht1.dot(dht1dU))#128x8
               dLdb[2] += (dydst1.T.dot(dLdy))*dst1dht1*dht1db#128x1

       return [ dLdV,dLdU, dLdW, dLdb, dLdc ]
   def predict( self, x): 
       pred = np.amax(x, axis = 1)
       pred_f = relu(pred)
       return pred_f

参数 V ， U ， W ， b ， c 通过 dLdV ， dLdU ， dLdW ， dLdb ， dLdc 计算的梯度> bptt 。

我尝试了不同的权重初始化（xavier或只是随机的），尝试了不同的时间截断。但是，所有这些都会导致相同的结果。重量更新可能不对吗？网络设置似乎很简单。真正难以理解谓词并转化为实际的生物量。我定义的函数 predict 是通过取最大值将GRU网络的输出层转换为生物量值。但是输出层几乎在所有时间迭代中都提供相似的值。虽然不确定执行此工作的最佳方法。感谢您的任何帮助或建议。

Answer 1

我怀疑stackoverflow上的任何人都会为您调试GRU的自定义实现。如果您正在使用Tensorflow或其他高级库，则可能会受到干扰，或者它是一个简单的完全连接的网络，但是我所能做的就是就如何进行调试提供一些建议。

首先，听起来您正在马上对自己的数据集运行全新的实现。相反，请首先尝试在简单的综合数据集上测试网络。它可以学习身份功能吗？响应仅仅是前三个时间戳的加权平均值吗？等等。调试一些简单的简单问题比较容易。一旦知道您的实现可以学习基于GRU的循环网络应该可以学习的内容，那么您就可以开始使用自己的数据了。

第二，您的评论非常有见地：

重量更新可能不正确吗？

虽然无法肯定地说，但这是backprop实现的一个非常普遍的-也许 the 最普遍的-错误源。吴安德建议gradient checking调试这样的实现。本质上，这涉及数值上近似的梯度。它的计算效率低下，但仅依赖于正向传播的正确实现，这对于调试非常有用。首先，如果使用数字近似梯度时算法收敛，则可以更确定自己的前向支撑是正确的，并专注于调试后向支撑。（另一方面，如果仍然无法成功，则很可能是正向prop函数中的问题。）另一方面，一旦算法使用了数值近似梯度，则可以比较解析梯度函数的输出并调试任何差异。这使它变得容易得多，因为您现在知道了应该返回的正确答案。

如何使用GRU RNN正确训练和预测生物质等价值？

1 个答案: