Question

我正在尝试通过重写Andrew Ng的Octave的机器学习课程作业来学习python（我参加了分类并获得了证书）。我遇到了优化功能的问题。在课程中，他们使用fmincg，这是Octave中使用的函数，以最小化提供其导数的线性回归的成本函数（凸函数）。他们还教你如何使用梯度下降和正规方程，如果它们被正确使用，理论上它们都会给你相同的结果（在几个小数位内）。它们都非常适合线性回归，我在python中得到了相同的结果。为了清楚起见，我正在尝试最小化成本函数以找到数据集的最佳拟合参数（theta）。到目前为止，我已经使用了'nelder-mead'，它不需要衍生物，它给了我最接近他们所拥有的解决方案。我也试过'TNC'，'CG'和'BFGS'，这些都需要衍生物来最小化功能。当我有一阶多项式（线性）时，它们都工作得很好但是当我将多项式的阶数增加到非线性时，在我的情况下我有x ^ 1到x ^ 8，那么我无法得到我的函数适合数据集。我正在做的练习非常简单，我有12个数据点，所以放一个8阶多项式应该捕获每一个点（如果你好奇它是一个高方差的例子，即过度拟合数据）。他们展示的解决方案是按预期方式遍历所有数据点并捕获所有内容的线。我得到的最好的是当我使用'nelder-mead'方法并且它只捕获了数据集中的两个点，而其余的最小化函数甚至没有给我任何接近我正在寻找的东西。我不确定是什么问题，因为我的成本函数和渐变为线性情况提供了正确的值，所以我假设它们工作正常（Octave的确切答案）。

我将列出Octave和python中的函数，希望有人可以向我解释为什么我会得到不同的答案。或者指出我没有看到的明显错误。

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear 
%regression with multiple variables
%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the 
%   cost of using theta as the parameter for linear regression to fit the 
%   data points in X and y. Returns the cost in J and the gradient in grad


m = length(y); % number of training examples 
J = 0;
grad = zeros(size(theta));

htheta = X * theta;
n = size(theta);
J = 1 / (2 * m) * sum((htheta - y) .^ 2) + lambda / (2 * m) * sum(theta(2:n) .^ 2);

grad = 1 / m * X' * (htheta - y);
grad(2:n) = grad(2:n) + lambda / m * theta(2:n); # we leave the bias nice 
grad = grad(:);

end

以下是我的代码片段，如果有人喜欢完整的代码，我也可以提供：

def costFunction(theta, Xcost, y, lmda):
    m = len(y)
    theta = theta.reshape((len(theta),1))
    htheta = np.dot(Xcost,theta) - y 
    J = 1 / (2 * m) * np.dot(htheta.T,htheta) + lmda / (2 * m) * np.sum(theta[1:,:]**2)
    return J

def gradCostFunc(gradtheta, X, y, lmda):
    m = len(y)
    gradtheta = gradtheta.reshape((len(gradtheta),1))
    hgradtheta = np.dot(X,gradtheta) - y 
    #gradtheta[0,0] = 0. 

    grad = (1 / m) * np.dot(X.T, hgradtheta)

    #for i in range(1,len(grad)):
    grad[1:,0] = grad[1:,0] + (lmda/m) * gradtheta[1:,0]
    return grad.reshape((len(grad)))

def normalEqn(X, y, lmda):
    e = np.eye(X.shape[1])
    e[0,0] = 0
    theta = np.dot(np.linalg.pinv(np.dot(X.T,X) + lmda * e),np.dot(X.T,y))
    return theta 

def gradientDescent(X, y, theta, alpha, lmda, num_iters):
    # calculate gradient descent in an iterative manner
    m = len(y)
    # J_history tracks the evolution of the cost function 
    J_history = np.zeros((num_iters,1))

    # Calculating the gradients 
    for i in range(0, num_iters):
        grad = np.zeros((len(theta),1))
        grad = gradCostFunc(theta, X, y, lmda)
        #updating the thetas 
        theta = theta - alpha * grad 
        J_history[i] = costFunction(theta, X, y, lmda)

    plt.plot(J_history)
    plt.show()

    return theta 

def trainLR(initheta, X, y, lmda):
    #print theta.shape, X.shape, y.shape, gradtest.shape gradCostFunc
    options = {'maxiter': 1000}
    res = optimize.minimize(costFunction, initheta, jac=gradCostFunc, method='CG',                            args=(X, y, lmda), options = options)
    #res = optimize.minimize(costFunction, theta, method='nelder-mead',                             args=(X,y,lmda), options={'disp': False})
    #res = optimize.fmin_bfgs(costFunction, theta, fprime=gradCostFunc, args=(X, y, lmda))
    return res.x

def polyFeatures(X, degree):
    # map the higher polynomials 
    out = X 
    if degree >= 2:
        for i in range(2,degree+1):
            out = np.column_stack((out,X**i))
        return out 
    else:
        return out

def featureNormalize(X):
    # Since the values will vary by orders of magnitudes 
    # It’s important to normalize the various features 
    mu = np.mean(X, axis=0)
    S1 = np.std(X, axis=0)
    return mu, S1, (X - mu)/S1

以下是这些功能的主要要求：

X, y, Xval, yval, Xtest, ytest = loadData('ex5data1.mat')
X_poly = X # to be used in the later on in the program 
p = 8 
X_poly = polyFeatures(X_poly, p)
mu, sigma, X_poly = featureNormalize(X_poly)
X_poly = padding(X_poly)
theta = np.zeros((X_poly.shape[1],1))
theta = trainLR(theta, X_poly, y, 0.)
#theta = normalEqn(X_poly, y, 0.)
#theta = gradientDescent(X_poly, y, theta, 0.1, 0, 1500)

Answer 1

我的回答可能不合适，因为您的问题是帮助调试当前的实现。

也就是说，如果您有兴趣在Python中使用现成的优化器，那么请查看OpenOpt。该库包含针对各种优化问题的合理性能优化器实现。

我还应该提到scikit-learn库为Python提供了一个很好的机器学习工具集。

不同的Python最小化函数给出了不同的值，为什么？

1 个答案: