R中的逻辑回归:与初始猜测有关的优化问题

时间:2018-09-10 23:14:36

标签: r optimization logistic-regression

我需要使用Score / GMM方法手动实现逻辑回归,而无需使用GLM。这是因为在以后的阶段中,模型将变得更加复杂。当前,我遇到一个问题,对于逻辑回归,优化过程非常依赖于初始点。为了说明这一点,这是我使用在线数据集的代码。有关该过程的更多详细信息,请参见注释:

library(data,table)
library(nleqslv)
library(Matrix)

mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
data_analysis<-data.table(mydata)
data_analysis[,constant:=1]

#Likelihood function for logit
#The logistic regression will regress the binary variable
#admit on a constant and the variable gpa

LL <- function(beta){
  beta=as.numeric(beta)
  data_temp=data_analysis
  mat_temp2 = cbind(data_temp$constant,
                    data_temp$gpa)
  one = rep(1,dim(mat_temp2)[1])
  h = exp(beta %*% t(mat_temp2))
  choice_prob = h/(1+h) 
  llf <- sum(data_temp$admit * log(choice_prob)) + (sum((one-data_temp$admit) * log(one-choice_prob)))
  return(-1*llf)
}

#Score to be used when optimizing using LL
#Identical to the Score function below but returns negative output

Score_LL <- function(beta){
  data_temp=data_analysis
  mat_temp2 = cbind(data_temp$constant,
                    data_temp$gpa)
  one = rep(1,dim(mat_temp2)[1])
  h = exp(beta %*% t(mat_temp2))
  choice_prob = h/(1+h) 
  resid = as.numeric(data_temp$admit - choice_prob)
  score_final2 =  t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one


  return(-1*as.numeric(score_final2))

}

#The Score/Deriv/Jacobian of the Likelihood function

Score <- function(beta){
  data_temp=data_analysis
  mat_temp2 = cbind(data_temp$constant,
                    data_temp$gpa)
  one = rep(1,dim(mat_temp2)[1])
  h = exp(beta %*% t(mat_temp2))
  choice_prob = as.numeric(h/(1+h)) 
  resid = as.numeric(data_temp$admit - choice_prob)
  score_final2 =  t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one


 return(as.numeric(score_final2))
  }


#Derivative of the Score function

Score_Deriv <- function(beta){
  data_temp=data_analysis
  mat_temp2 = cbind(data_temp$constant,
                    data_temp$gpa)
  one = rep(1,dim(mat_temp2)[1])
  h = exp(beta %*% t(mat_temp2))
  weight = (h/(1+h)) * (1- (h/(1+h)))  
  weight_mat = Diagonal(length(weight), x=weight)
  deriv = t(mat_temp2)%*%weight_mat%*%mat_temp2
  return(-1*as.array(deriv))

}

#Quadratic Gain function
#Minimized at Score=0 and so minimizing is equivalent to solving the 
#FOC of the Likelihood. This is the GMM approach.

Quad_Gain<- function(beta){
  h=Score(as.numeric(beta))
  return(sum(h*h))
}

#Derivative of the Quadratic Gain function
Quad_Gain_deriv <- function(beta){
  return(2*t(Score_Deriv(beta))%*%Score(beta))
}

sol1=glm(admit ~ gpa, data = data_analysis, family = "binomial")
sol2=optim(c(2,2),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")
sol3=optim(c(0,0),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")

运行此代码时,我发现sol3与glm生成的内容(sol1)相匹配,但是sol2(具有不同的初始点)与glm解决方案有很大不同。这在我的主代码中也发生了实际数据。一种解决方案是创建一个网格并测试多个起点。但是,我的主要数据集有10个参数,这将使网格非常大,并且程序在计算上不可行。有办法解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

您的代码似乎过于复杂。以下两个函数定义了具有logit链接的逻辑回归的负对数似然率和负分数向量:

host="*"

然后您可以按以下方式使用它:

docker run -d -p 80:9080 -p 443:9443 your-liberty-name

通常,对于行为良好的协变量(即,期望系数介于[-4至4]之间),从0开始是个好主意。