动量优化器的学习率变化

时间:2016-11-05 21:28:27

标签: tensorflow deep-learning

当运行现有的Tensorflow实现时,我发现不同时期的学习速率保持不变。原始实现使用tf.train.MomentumOptimizer,并设置decay rate

我对动量优化器的理解是,学习率应随着时期而降低。为什么我的培训过程中学习率保持不变。学习率是否也可能取决于表现,例如,如果表现没有迅速改变,那么学习率将保持不变。我认为我对动量优化器的潜在机制并不十分清楚,并且感到困惑的是学习率与时代保持一致,即使我认为它应该根据给定的衰减率继续下降。

优化器定义如下

        learning_rate =  0.2
        decay_rate = 0.95            
        self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate, 
                                                    global_step=global_step, 
                                                    decay_steps=training_iters,  
                                                    decay_rate=decay_rate, 
                                                    staircase=True)

        optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node).minimize(self.net.cost, 
                                                                            global_step=global_step)

1 个答案:

答案 0 :(得分:0)

如果我的回答对您有帮助,那么在不查看代码的情况下有点难以辨别。

但是,如果您需要一些关于动量优化器如何工作以及学习率应该衰减的见解。

  1. 首先是香草GradientDescentMinimizer的最新基础更新:

    W ^(n + 1)= W ^(n)-alpha *(成本与W的梯度)(W ^ n)

    您只是跟随渐变的反面。

  2. 学习率下降的GradientDescentMinimizer

    W ^(n + 1)= W ^(n)-alpha(n)*(成本与W的梯度)(W ^ n)

  3. 唯一改变的是学习率α,它现在取决于Tensorflow中的步骤,最常用的是指数衰减,其中N步后学习率除以某个常数,即10。

    这种变化经常发生在培训的后期,所以你可能需要让几个时代过去才能看到它腐烂

    1. Momentumoptimizer: 你必须保留一个额外的变量:你之前做的更新,即你必须在每个时间步骤存储:

      更新^(n)=(W ^(n)-W ^(n-1))

      然后按动量更正的更新是:

      更新^(n + 1)= m 更新^(n)-alpha (成本与W的梯度W)(W ^ n)

    2. 所以你正在做的是做简单的梯度下降,但通过记住过去的过去来纠正它(有更聪明,更复杂的方式,就像Nesterov的动力一样)

        学习率下降的
      1. MomentumOptimizer
      2. 更新^(N)=(W ^(n)的-W 1(N-1))

        更新^(n + 1)= m 更新^(n)-alpha(n)(成本与W的梯度W)(W ^ n)

        alpha现在也依赖于n。

        因此,在某一时刻,它会随着学习速度衰减而逐渐下降,但是下降会受到动量的影响。

        要对这些方法进行全面审核,您需要the excellent website解释远比我好Alec Radford's famous visualization,这比千言万语好。

        除非在衰变中指定,否则学习率不应取决于表现! 看看有问题的代码会有所帮助!

        EDIT1 :: 以下是一个工作示例,我想回答您提出的两个问题:

        import tensorflow as tf
        import numpy as np
        import matplotlib.pyplot as plt
        
        #Pure SGD
        BATCH_SIZE=1
        #Batch Gradient Descent
        #BATCH_SIZE=1000
        starter_learning_rate=0.001
        
        xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis]
        ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis]
        plt.scatter(xdata,ydata)
        
        x=tf.placeholder(tf.float32,[None,1])
        y=tf.placeholder(tf.float32, [None,1])
        #We define global_step as a variable initialized at 0
        global_step=tf.Variable(0,trainable=False)
        
        w1=tf.Variable(0.05*tf.random_normal((1,100)),tf.float32)
        w2=tf.Variable(0.05*tf.random_normal((100,1)),tf.float32)
        
        b1=tf.Variable(np.zeros([100]).astype("float32"),tf.float32)
        b2=tf.Variable(np.zeros([1]).astype("float32"),tf.float32)
        
        h1=tf.nn.relu(tf.matmul(x,w1)+b1)
        y_model=tf.matmul(h1,w2)+b2
        
        
        L=tf.reduce_mean(tf.square(y_model-y))
        
        #We want to decrease the learning rate after having seen all the data 5 times
        NUM_EPOCHS_PER_DECAY=5
        LEARNING_RATE_DECAY_FACTOR=0.1
        #Since the mechanism of the decay depends on the number of iterations and not epochs we have to connect the number of epochs to the number of iterations
        
        #So if we have batch_size=1 we have to iterate exactly 1000 times to do one epoch so 5*1000=5000 iterations before decaying if the batch_size was 1000 1 iterations=1epoch and thus we decrease it after 5 iterations 
        num_batches_per_epoch=int(xdata.shape[0]/float(BATCH_SIZE))
        decay_steps=int(num_batches_per_epoch*NUM_EPOCHS_PER_DECAY)
        decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate, 
                                                             global_step, 
                                                             decay_steps, 
                                                             LEARNING_RATE_DECAY_FACTOR,
                                                             staircase=True)
        
        #So now we have an object that depends on global_step and that will be divided by 10 every decay_steps iterations i.e. when global_step=N*decay_steps with N a non zero integer         
        
        #We now create a train_step to which we pass the learning rate created each time this function is called global_step will be incremented by 1 we are gonna check that it is the case BE CAREFUL WE HAVE TO GIVE IT GLOBAL_STEP AS AN ARGUMENT
        train_step=tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(L,global_step=global_step)
        sess=tf.Session()
        sess.run(tf.initialize_all_variables())
        GLOBAL_s=[]
        lr_val=[]
        COSTS=[]
        for i in range(16000):
            #We will do 1600 iterations so as there is a decay every 5000 iterations we will see 3 decays (5000,10000,15000)
            start_data=(i*BATCH_SIZE)%1000
            COSTS.append([sess.run(L, feed_dict={x:xdata,y:ydata})])
            GLOBAL_s.append([sess.run(global_step)])
            lr_val.append([sess.run(decayed_learning_rate)])
            #I see the train_step as implicitely executing sess.run(tf.add(global_step,1))
            sess.run(train_step,feed_dict={x:xdata[start_data:start_data+BATCH_SIZE],y:ydata[start_data:start_data+BATCH_SIZE]})
        
        
        plt.figure()
        plt.subplot(211)
        plt.plot(GLOBAL_s,lr_val,"-b")
        plt.title("Evolution of learning rate" )
        plt.subplot(212)
        plt.plot(GLOBAL_s,COSTS,".g")
        plt.title("Evolution of cost" )
        
        
        
        #notice two things first global_step is actually being incremented and learning rate is actually being decayed
        

        (你可以明显写出MomentumOptimize()而不是GradientDescentOptimizer()......)

        以下是我得到的两个图: Here is the evolution of the global_step parameter and decayed_learning_rate 当你调用train_step tensorflow运行tf.add(global_step,1)

        时,在我的脑海中总结一下