由于这篇文章很长,我将在这里提出我的问题:
1。在python中是否有一个软件包可以为给定数量的参数p,协变量x和数据值y提供最大似然估计器参数? (最好提供有关如何实施的全面文档)
2。此方法是否可扩展,即如果我尝试O(100)协变量和类似数量的数据样本,是否可以使用?
背景:
我正在尝试使用python研究诸如具有不同样本数n /协变量p的最大似然估计器的分布之类的事情。我的脚本可以很好地生成用于逻辑回归的数据,但我无法获得任何参数估计方法(即,使对数似然最大化的参数值)都能正常工作。
我尝试过的方法:
-编写我自己的Newton Raphson过程版本。但是我估计中的误差在反复迭代中有所不同(我当然检查了明显的符号和不等式误差!)
-使用牛顿共轭梯度实现。由于错误“数组的真值不明确”,该操作也失败了。在编写自己的版本时,可以使用all()来解决此问题,但在使用程序包时则不能。奇怪,是因为我认为Newton CG适用于多变量情况!
-最后,我想使用软件包statsmodels。我不知道我是否真的很钝,但是我找不到关于它的全面文档?为了获得逻辑回归参数,我发现的最好的方法是 this,例如,
X,y = logit_data(np.power(10,6),p,theta)
y=np.reshape(y, (len(y),))
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)
thetaEst = clf.get_params(X, y)
我已经尝试了最后一行:
thetaEst = clf.get_params()
但是似乎没有什么可以给我估计参数值。我有错误或意外的对象。当然有应该执行此操作的python包?!?当然,我不必求助于R(我不知道R D:!!!)
可选 代码
我不想把这篇文章加长,但是我肯定会要求它。因此,我已经输入了实现Newton Raphson的代码,并用现有的软件包替换了该函数:
#Script for producing y data from p covariates from a specified distribution, specified theta paraemters,
#and n data samples for logit link function.
import numpy as np
import matplotlib.pyplot as plt
#Define link function here
def g(z):
g=1/(1+np.exp(-z))
return g
#For producing y data values given true paramters theta and number of covariates
def logit_data(n,p, theta):
#Define parameters
#1)Number of covariates
p_i = p+1 #with intercept
p_i=np.int(p_i)
#2) m as correct data type
n=np.int(n)
#4)Specify parameter valueas to be estimated
theta=np.reshape(theta, (p_i,1))
#5)Define distribution from which covariate values are drawn i.i.d., and initiate data values
X=np.zeros((n,p_i))
X[:,0]=1 #intercept
mean=0
sigma=1.5
X[:,1:]=np.random.normal(mean,sigma,(n,p))
#6)Produce y values treating y as a Bernoulli variable with p=g(X*theta)
r=np.random.uniform(0,1,n)
r=np.reshape(r, (len(r),1))
htrue=g(X.dot(theta))
y=htrue-r
y[y>=0]=1
y[y<0]=0
return X, y
#Newton Raphson implementation
def NewtonRaphson(X,y):
##NOTE: All functions negloglikelihood, gradf, hessian, return the values for f = (-ve of the log likelihood function)
#, to use NR method to minimise f (rather than maximise l)
#Define log likelihood function to be maximised
def negloglikelihood(y,h):
l= y.transpose() @ np.log(h) + (1-y).transpose() @ np.log(1-h)
f=-l
return f
#Define gradient of log likelihood function
def gradf(y, h, X):
a=(y-h).transpose()
gradl= np.matmul(a,X)
grad_f=-gradl
return grad_f
#Define second derivative (Hessian) of log likelihood function
def hessian(h, X):
D=np.identity(len(h))*(np.matmul(h,(1-h).transpose()))
H=-X.transpose() @ D @ X
Hf=-H
return Hf
#Minimise f=-l
#Produce initial theta estimate and probability parameter h
np.random.seed(555)
thetaEst=np.random.normal(1.1, 0.4, 6)
eta=np.matmul(X,thetaEst)
h=g(eta)
#While not at a manimum of f
#control constants
a=10e-8
b=10e-8
i=0
j=0
k=0
while not (np.linalg.norm(gradf(y,h,X)) < np.absolute(negloglikelihood(y,h)) * a + b):
i=i+1
#print('i = %s' %i)
h=g(np.matmul(X,thetaEst))
H=hessian(h,X) #Cholesky decomposition to ensure hessian (of f) is positive semi-definite)
# print(H)
try:
np.linalg.cholesky(H)
#print('j = %s' %j)
j=j+1
except np.linalg.LinAlgError:
print('Hessian not positive semi-definite!')
try:
v,w=np.linalg.eig(H)
# print(v,w)
v=np.absolute(v)
H=w @ np.diag(v) @ np.linalg.inv(w)
except:
return thetaEst
delta = 0
try:
delta=np.linalg.solve(H, np.reshape(gradf(y,h,X),(6,1))) #Solve for incremental theta step
#print('k = %s' %k)
k=k+1
except:
return thetaEst #Simply return theta estimate if have singular hessian
while negloglikelihood(y, h) > negloglikelihood(y, g(np.matmul(X,thetaEst+delta))):
print('!!')
delta=0.5*delta #Ensure added step is sufficently small so as not to diverge
thetaEst=thetaEst+delta
return thetaEst
#Main control
#1)Sample numbers to test for erros in beta, as powers of 10.
npowers=np.arange(1,2,0.05)
n=np.power(10,npowers)
#2)Number of independent covariates
p=5
#3)True theta to be estimated (parameter values)
theta=np.asarray([1,1.2,1.1,0.8,0.9,1.3])
#4)#Initiate arrays to store estimates of theta (and errors) computed at specified sample numbers N
Thetas=np.zeros((len(npowers),p+1))
Errors=np.zeros((len(npowers),p+1))
#5)Obtain random covariate values from specified distribution, and corresponding y values using true theta
#plus gaussian noise term.
X,y = logit_data(n[-1],p,theta)
#6)Calulcate cumulative means for given n values, for the theta estimates
for ind,N in enumerate(n):
N=np.int(N)
thetaTemp=NewtonRaphson(X[0:N,:],y[0:N])
Thetas[ind,:] = np.reshape(thetaTemp,6)
#7)Calculate true erros
#print(Thetas)
Errors=Thetas-theta.transpose()
absError=np.abs(Errors)
nerror=Errors*np.sqrt(n)[:,np.newaxis]
logerror = np.log10(absError)
#8)Save data as csv
#9)Plots
plt.scatter(X@theta, g(X@theta))
plt.scatter(X@theta,y)
plt.show()
fig=plt.figure()
for i in range(p+1):
plt.plot(npowers, logerror[:,i])
fig.suptitle('log10(Absolute Error) in MLE against log10(Number of samples,N) for logistic regression')
plt.xlabel('log_10(N)')
plt.ylabel('log_10(Absolute Error)')
fig.savefig('logiterrors7.png')
plt.show()
答案 0 :(得分:1)
可以在here中找到有关最新开发版本的有关statsmodels中逻辑回归模型的文档。所有模型都遵循一系列熟悉的步骤,因此这应提供足够的信息以在实践中实施(请务必查看一些示例,例如here)。我一般不建议您重新实现一般在scipy
或statsmodels
中已经可用的求解器/模型,除非您有非常特殊的需要。
现在,我已使用您的脚本来生成一些数据,并使用Logit
模型来估算参数,如下所示,
import statsmodels.api as sm
X, y = logit_data(n[-1], p, theta)
model = sm.Logit(y, X)
result = model.fit()
print(result.summary())
此输出(您的里程可能会有所不同)
Optimization terminated successfully.
Current function value: 0.203609
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 89
Model: Logit Df Residuals: 83
Method: MLE Df Model: 5
Date: Wed, 17 Jul 2019 Pseudo R-squ.: 0.7062
Time: 13:40:37 Log-Likelihood: -18.121
converged: True LL-Null: -61.684
LLR p-value: 2.695e-17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0735 0.540 1.986 0.047 0.014 2.133
x1 2.0890 0.594 3.518 0.000 0.925 3.253
x2 1.7191 0.459 3.746 0.000 0.820 2.618
x3 1.7228 0.464 3.713 0.000 0.813 2.632
x4 1.1897 0.410 2.902 0.004 0.386 1.993
x5 2.2008 0.653 3.370 0.001 0.921 3.481
==============================================================================
Possibly complete quasi-separation: A fraction 0.10 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
可以通过result.params
如下访问这些系数,
>>>result.params
[1.0734945 2.08898192 1.71907914 1.72278748 1.18972079 2.20079805]