Question

请看这个伪python代码。这只是为了ilustration。

import numpy as np

X = np.array([[1,0],[0,0]])
y = np.array([1,0])

w0 = np.random.random((2,1))

learning_rate= 0.01

for i in range(100):
    # we shuffle the data
    # we take part of data

    #forward propagation(bias, matrix multiplication, activation)

    #we get cost function

    #we take partial derivative of f/w0 as delta

    w0+= X.T.dot(delta) * learning_rate

现在我最大的问题是理解亚当以及使用动量和收敛的所有其他功能。我花了几天时间从数学中重建我的知识，但是我完全坚持了它。

我的想法是：动量帮助我们不会陷入局部极小。它推动了山下的力量，所以我真的不明白收敛与它有什么关系。我在python中找到了Adam。

Sagarvegad

https://github.com/sagarvegad/Adam-optimizer/blob/master/Adam.py。

import math

alpha = 0.01

beta_1 = 0.9

beta_2 = 0.999                      #initialize the values of the parameters

epsilon = 1e-8

def func(x):

    return x*x -4*x + 4

def grad_func(x):                   #calculates the gradient

    return 2*x - 4

theta_0 = 0                     #initialize the vector

m_t = 0 

v_t = 0 

t = 0



while (1):                  #till it gets converged

    t+=1

    g_t = grad_func(theta_0)        #computes the gradient of the stochastic function

    m_t = beta_1*m_t + (1-beta_1)*g_t   #updates the moving averages of the gradient

    v_t = beta_2*v_t + (1-beta_2)*(g_t*g_t) #updates the moving averages of the squared gradient

    m_cap = m_t/(1-(beta_1**t))     #calculates the bias-corrected estimates

    v_cap = v_t/(1-(beta_2**t))     #calculates the bias-corrected estimates

    theta_0_prev = theta_0                              

    theta_0 = theta_0 - (alpha*m_cap)/(math.sqrt(v_cap)+epsilon)    #updates the parameters

    if(theta_0 == theta_0_prev):        #checks if it is converged or not

        break

有人会这么好，并描述我发生了什么......从theta_0开始，我如何在我的伪代码中实现这一点。我非常感谢你的帮助，因为这让我感到很沮丧。

让我告诉你，我已经阅读了关于亚当的官方出版物和其他资源，但是我很难理解。

在NN中简单实现Adam

0 个答案: