Question

在随机梯度下降中，我们通常将目标函数视为有限数量的函数之和：

             f(x)=∑fi(x) where i = 1 : n

在每次迭代中，随机梯度下降而不是计算梯度∇f(x)，而是均匀地随机采样i并计算∇fi(x)。

洞察力是，随机梯度下降使用∇fi(x)作为∇f(x)的无偏估计量。

我们将x更新为：x:=x−η∇fi(x)，其中η是学习步骤。

对于优化问题，我发现在R中实现此困难。

stoc_grad<-function(){
  # set up a stepsize
  alpha = 0.1

  # set up a number of iteration
  iter = 30

  # define the objective function f(x) = sqrt(2+x)+sqrt(1+x)+sqrt(3+x)
  objFun = function(x) return(sqrt(2+x)+sqrt(1+x)+sqrt(3+x))

  # define the gradient of f(x) = sqrt(2+x)+sqrt(1+x)+sqrt(3+x)
  gradient_1 = function(x) return(1/2*sqrt(2+x))
  gradient_2 = function(x) return(1/2*sqrt(3+x))
  gradient_3 = function(x) return(1/2*sqrt(1+x))

  x = 1

  # create a vector to contain all xs for all steps
  x.All = numeric(iter)

  # gradient descent method to find the minimum
  for(i in seq_len(iter)){
    x = x - alpha*gradient_1(x)
    x = x - alpha*gradient_2(x)
    x = x - alpha*gradient_3(x)
    x.All[i] = x
    print(x)
  }

  # print result and plot all xs for every iteration
  print(paste("The minimum of f(x) is ", objFun(x), " at position x = ", x, sep = ""))
  plot(x.All, type = "l")  

}

算法伪代码： Find pseudo-code here

实际上，我想测试此算法以优化测试功能，例如三峰驼功能。

https://en.wikipedia.org/wiki/Test_functions_for_optimization

其他示例：

enter image description here

Answer 1

这里似乎有很多困扰您。到目前为止，按照重要性的顺序，这是我到目前为止发现的两个错误：

当您有大量数据时，将使用随机梯度下降法，因为对于这些数据，每次迭代评估所有训练观测值的目标函数在计算上是昂贵的。那不是您要解决的问题。观看简短的入门here
当您的参数受到支持时，例如x≥-1，除非您防止NaN的传播，否则您将遇到问题。

这是一个梯度下降实现，可以解决您的问题（我在重要更改中添加了代码注释）：

# Having the number of iterations, step size, and start value be parameters the
# user can alter (with sane default values) I think is a better approach than
# hard coding them in the body of the function
grad<-function(iter = 30, alpha = 0.1, x_init = 1){

    # define the objective function f(x) = sqrt(2+x)+sqrt(1+x)+sqrt(3+x)
    objFun = function(x) return(sqrt(2+x)+sqrt(1+x)+sqrt(3+x))

    # define the gradient of f(x) = sqrt(2+x)+sqrt(1+x)+sqrt(3+x)
    # Note we don't split up the gradient here
    gradient <- function(x) {
        result <- 1 / (2 * sqrt(2 + x))
        result <- result + 1 / (2 * sqrt(1 + x))
        result <- result + 1 / (2 * sqrt(3 + x))
        return(result)
    }

    x <- x_init

    # create a vector to contain all xs for all steps
    x.All = numeric(iter)

    # gradient descent method to find the minimum
    for(i in seq_len(iter)){
        # Guard against NaNs
        tmp <- x - alpha * gradient(x)
        if ( !is.nan(suppressWarnings(objFun(tmp))) ) {
            x <- tmp
        }
        x.All[i] = x
        print(x)
    }

    # print result and plot all xs for every iteration
    print(paste("The minimum of f(x) is ", objFun(x), " at position x = ", x, sep = ""))
    plot(x.All, type = "l")  

}

正如我之前说的，我们知道最小化问题的解析解：x = -1。因此，让我们看看它是如何工作的：

grad()

[1] 0.9107771
[1] 0.8200156
[1] 0.7275966
...
[1] -0.9424109
[1] -0.9424109
[1] "The minimum of f(x) is 2.70279857718352 at position x = -0.942410938107257"

如何实现随机梯度下降

1 个答案: