使用“预测”函数进行逻辑回归时出错

时间:2021-04-02 12:24:49

标签: python logistic-regression statsmodels

我正在尝试拟合多项逻辑回归,然后根据样本预测结果。

<块引用>

$$ \Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt,. $$

### RZS_TC is my dataframe
RZS_TC.loc[RZS_TC['Mean_Treecover'] <= 50, 'Mean_Treecover' ] = 0
RZS_TC.loc[RZS_TC['Mean_Treecover'] > 50, 'Mean_Treecover' ] = 1
RZS_TC[['MAP']+['Sr']+['delTC']+['Mean_Treecover']].head()

[Output]:
                 MAP        Sr       delTC  Mean_Treecover
302993741   2159.297363 452.975647  2.666672    1.0
217364332   3242.351807 65.615341   8.000000    1.0
390863334   1617.215454 493.124054  5.666666    0.0
446559668   1095.183105 498.373383  -8.000000   0.0
246078364   2804.615234 98.981110   -4.000000   1.0
1000000 rows × 7 columns

#Fitting a logistic regression
from statsmodels.formula.api import mnlogit
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()

print(model.summary2())
[Output]:
                          Results: MNLogit
====================================================================
Model:                MNLogit          Pseudo R-squared: 0.364      
Dependent Variable:   Mean_Treecover   AIC:              831092.4595
Date:                 2021-04-02 13:51 BIC:              831139.7215
No. Observations:     1000000          Log-Likelihood:   -4.1554e+05
Df Model:             3                LL-Null:          -6.5347e+05
Df Residuals:         999996           LLR p-value:      0.0000     
Converged:            1.0000           Scale:            1.0000     
No. Iterations:       7.0000                                        
--------------------------------------------------------------------
Mean_Treecover = 0  Coef.  Std.Err.     t     P>|t|   [0.025  0.975]
--------------------------------------------------------------------
         Intercept -5.2200   0.0119 -438.4468 0.0000 -5.2434 -5.1967
               MAP  0.0023   0.0000  491.0859 0.0000  0.0023  0.0023
                Sr  0.0016   0.0000   90.6805 0.0000  0.0015  0.0016
             delTC -0.0093   0.0002  -39.9022 0.0000 -0.0098 -0.0089

然而,无论我在哪里尝试使用 model.predict() 函数进行预测,我都会遇到以下错误

prediction = model.predict(np.array(RZS_TC[['MAP']+['Sr']+['delTC']]))
[Output]: ERROR! Session/line number was not unique in database. History logging moved to new session 2627

有谁知道如何解决这个问题?我可能做错了什么吗?

1 个答案:

答案 0 :(得分:1)

该模型添加了一个拦截,因此您需要使用示例数据将其包括在内:

from statsmodels.formula.api import mnlogit
import pandas as pd
import numpy as np
RZS_TC = pd.DataFrame(np.random.uniform(0,1,(20,4)),
columns=['MAP','Sr','delTC','Mean_Treecover'])

RZS_TC['Mean_Treecover'] = round(RZS_TC['Mean_Treecover'])

model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()

您可以看到拟合数据的维度:

model.model.exog[:5,]
Out[16]: 
array([[1.        , 0.33914763, 0.79358056, 0.3103758 ],
       [1.        , 0.45915785, 0.94991271, 0.27203524],
       [1.        , 0.55527662, 0.15122108, 0.80675951],
       [1.        , 0.18493681, 0.89854583, 0.66760684],
       [1.        , 0.38300074, 0.6945397 , 0.28128137]])

这与添加常量相同:

import statsmodels.api as sm
sm.add_constant((RZS_TC[['MAP','Sr','delTC']])

    const       MAP        Sr     delTC
0     1.0  0.339148  0.793581  0.310376
1     1.0  0.459158  0.949913  0.272035
2     1.0  0.555277  0.151221  0.806760
3     1.0  0.184937  0.898546  0.667607

如果您有一个具有相同列名的 data.frame,它将是:

prediction = model.predict(RZS_TC[['MAP','Sr','delTC']])

或者,如果您只需要拟合值,请执行以下操作:

model.fittedvalues