实现逻辑回归时的LinAlgError奇异矩阵

时间:2020-10-15 23:36:26

标签: python pandas logistic-regression statsmodels

我正在尝试实现一个逻辑回归模型,但是当我尝试打印结果时,出现了一个错误,我已经查找并试图找出解决方法,但是还没有解决。

这是下面的样子:

#Columns
columns = new_df[['DIABETES_NO','DIABETES_INSULIN', 'DIABETES_NON-INSULIN', 'bmi_cat_0','bmi_cat_gte40','bmi_cat_lt40',
                 'albumin_cat_0', 'albumin_cat_gt3.5', 'albumin_cat_lt3.5', 'SMOKE_No', 'SMOKE_Yes',
                 'age_cat_0', 'age_cat_gte65', 'age_cat_lt65', 'SEX_male', 'SEX_female']]
#Model 1 Target Variable (Mortality)

X = columns
y = new_df['Mortality']

logit_model=sm.Logit (y,X)
result=logit_model.fit()
print(result.summary2())
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.014645
         Iterations: 35
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-35-0a3dafc9126f> in <module>
      5 
      6 logit_model=sm.Logit (y,X)
----> 7 result=logit_model.fit()
      8 print(result.summary2())

E:\Users\davidwool\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   1832         bnryfit = super(Logit, self).fit(start_params=start_params,
   1833                 method=method, maxiter=maxiter, full_output=full_output,
-> 1834                 disp=disp, callback=callback, **kwargs)
   1835 
   1836         discretefit = LogitResults(self, bnryfit)

E:\Users\davidwool\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    218         mlefit = super(DiscreteModel, self).fit(start_params=start_params,
    219                 method=method, maxiter=maxiter, full_output=full_output,
--> 220                 disp=disp, callback=callback, **kwargs)
    221 
    222         return mlefit # up to subclasses to wrap results

E:\Users\davidwool\Anaconda3\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    471             Hinv = cov_params_func(self, xopt, retvals)
    472         elif method == 'newton' and full_output:
--> 473             Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
    474         elif not skip_hessian:
    475             H = -1 * self.hessian(xopt)

E:\Users\davidwool\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in inv(a)
    530     signature = 'D->D' if isComplexType(t) else 'd->d'
    531     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 532     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    533     return wrap(ainv.astype(result_t, copy=False))
    534 

E:\Users\davidwool\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag)
     87 
     88 def _raise_linalgerror_singular(err, flag):
---> 89     raise LinAlgError("Singular matrix")
     90 
     91 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

我尝试将method ='bfgs'设置为,但除Coeff列以外的所有区域均显示NaN。

这是下面的样子:

#Model 1 Target Variable (Mortality)

X = columns
y = new_df['Mortality']

logit_model=sm.Logit (y,X)
result=logit_model.fit(method='bfgs')
print(result.summary2())
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.014671
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.090     
Dependent Variable: Mortality        AIC:              329.5189  
Date:               2020-10-15 19:32 BIC:              402.1568  
No. Observations:   10549            Log-Likelihood:   -154.76   
Df Model:           9                LL-Null:          -170.03   
Df Residuals:       10539            LLR p-value:      0.00035468
Converged:          0.0000           Scale:            1.0000    
-----------------------------------------------------------------
                          Coef.  Std.Err.  z  P>|z| [0.025 0.975]
-----------------------------------------------------------------
DIABETES_NO              -1.3211      nan nan   nan    nan    nan
DIABETES_INSULIN         -0.1911      nan nan   nan    nan    nan
DIABETES_NON-INSULIN     -0.2797      nan nan   nan    nan    nan
bmi_cat_0                -0.0321      nan nan   nan    nan    nan
bmi_cat_gte40            -1.0971      nan nan   nan    nan    nan
bmi_cat_lt40             -0.6626      nan nan   nan    nan    nan
albumin_cat_0            -1.7288      nan nan   nan    nan    nan
albumin_cat_gt3.5        -0.7371      nan nan   nan    nan    nan
albumin_cat_lt3.5         0.6740      nan nan   nan    nan    nan
SMOKE_No                 -1.0509      nan nan   nan    nan    nan
SMOKE_Yes                -0.7410      nan nan   nan    nan    nan
age_cat_0                -0.0321      nan nan   nan    nan    nan
age_cat_gte65            -0.0337      nan nan   nan    nan    nan
age_cat_lt65             -1.7261      nan nan   nan    nan    nan
SEX_male                 -1.2519      nan nan   nan    nan    nan
SEX_female               -0.5400      nan nan   nan    nan    nan
=================================================================

任何帮助或建议将不胜感激,谢谢!!

1 个答案:

答案 0 :(得分:0)

很显然,您不了解如何处理分类变量。每个分类变量都包含一整套的一键编码的虚拟变量(例如,一起包含SEX_maleSEX_female),本质上会在回归中引入多个常量,从而导致奇异的矩阵误差。