Patsy版本和Dummy List版本之间使用Statsmodels进行线性回归的差异

时间:2019-02-18 16:33:57

标签: python algorithm linear-regression statsmodels

使用smf.ols的{​​{1}}和sm.OLS函数,我在系数值和系数误差方面存在差异。即使在数学上,它们也应该是相同的回归公式并给出相同的结果。

我已经完成了我的问题的100%可复制示例,可以从此处下载数据框df:https://drive.google.com/drive/folders/1i67wztkrAeEZH2tv2hyOlgxG7N80V3pI?usp=sharing

案例1:使用来自Statsmodels的Patsy的线性模型

statsmodels

lm1的结果是:

# First we load the libraries:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import random
import pandas as pd
# We define a specific seed to have the same results:
random.seed(1234)
# Now we read the data that can be downloaded from Google Drive link provided above:
df = pd.read_csv("/Users/user/Documents/example/cars.csv", sep = "|")
# We create the linear regression:
lm1 = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
# We see the results:
lm1.fit().summary()

案例2:线性模型也使用Statsmodels中的虚拟变量

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Mon, 18 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        17:19:14   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
make[T.audi]           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make[T.bmw]            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make[T.chevrolet]      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make[T.dodge]         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make[T.honda]          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make[T.isuzu]          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make[T.jaguar]         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make[T.mazda]           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make[T.mercedes-benz]  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make[T.mercury]        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make[T.mitsubishi]    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make[T.nissan]        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make[T.peugot]         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make[T.plymouth]       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make[T.porsche]        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make[T.renault]       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make[T.saab]           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make[T.subaru]        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make[T.toyota]         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make[T.volkswagen]      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make[T.volvo]          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system[T.2bbl]    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system[T.4bbl]     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system[T.idi]     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system[T.mfi]     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system[T.mpfi]    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system[T.spdi]    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system[T.spfi]     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type[T.dohcv]  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type[T.l]      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type[T.ohc]    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type[T.ohcf]    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type[T.ohcv]    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type[T.rotor]   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors[T.two]    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
bore                   3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio     -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height                  -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm                 -0.5903      0.790     -0.747      0.456      -2.150       0.970
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

lm2的结果是:

# We define a specific seed to have the same results:
random.seed(1234)
# First we check what `object` type variables we have in our dataset:
df.dtypes
# We create a list where we save the `object` type variables names:
object = ['make', 
          'fuel_system', 
          'engine_type', 
          'num_of_doors'
          ]
# Now we convert those object variables to numeric with get_dummies function to have 1 unique numeric dataframe:
df_num = pd.get_dummies(df, columns = object)
# We ensure the dataframe is numeric casting all values to float64:
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
# We define the predictive variables dataset:
X = df_num.drop('price', axis = 1)
# We define the response variable values:
y = df_num.price.values
# We add a constant as we did in the previous example (adding "+1" to Patsy):
Xc = sm.add_constant(X) # Adds a constant to the model
# We create the linear model and obtain results:
lm2 = sm.OLS(y, Xc)
lm2.fit().summary()

我们可以看到,像 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.894 Model: OLS Adj. R-squared: 0.868 Method: Least Squares F-statistic: 35.54 Date: Mon, 18 Feb 2019 Prob (F-statistic): 5.24e-62 Time: 17:28:16 Log-Likelihood: -1899.7 No. Observations: 205 AIC: 3879. Df Residuals: 165 BIC: 4012. Df Model: 39 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- const 1.205e+04 6811.094 1.769 0.079 -1398.490 2.55e+04 bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306 compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977 height -80.7141 146.219 -0.552 0.582 -369.417 207.988 peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970 make_alfa-romero -2273.9631 1865.185 -1.219 0.225 -5956.669 1408.743 make_audi 4245.7414 1324.140 3.206 0.002 1631.299 6860.184 make_bmw 1.199e+04 1232.635 9.730 0.000 9559.555 1.44e+04 make_chevrolet -2845.7867 1976.730 -1.440 0.152 -6748.733 1057.160 make_dodge -3460.3061 1170.966 -2.955 0.004 -5772.315 -1148.297 make_honda 505.6865 2049.865 0.247 0.805 -3541.661 4553.034 make_isuzu 825.0045 1706.160 0.484 0.629 -2543.716 4193.725 make_jaguar 1.525e+04 1903.813 8.010 0.000 1.15e+04 1.9e+04 make_mazda -1967.3063 982.179 -2.003 0.047 -3906.564 -28.048 make_mercedes-benz 1.471e+04 1423.004 10.338 0.000 1.19e+04 1.75e+04 make_mercury 684.1370 2913.361 0.235 0.815 -5068.136 6436.410 make_mitsubishi -3462.7968 1221.018 -2.836 0.005 -5873.631 -1051.963 make_nissan -3485.5094 946.316 -3.683 0.000 -5353.958 -1617.060 make_peugot 783.0586 3513.296 0.223 0.824 -6153.754 7719.871 make_plymouth -3168.5552 1293.376 -2.450 0.015 -5722.256 -614.854 make_porsche 7284.9115 2853.174 2.553 0.012 1651.475 1.29e+04 make_renault -4398.9354 2037.945 -2.159 0.032 -8422.747 -375.124 make_saab 1216.5702 1487.192 0.818 0.415 -1719.810 4152.950 make_subaru -1.863e+04 3263.524 -5.710 0.000 -2.51e+04 -1.22e+04 make_toyota -3044.9308 776.059 -3.924 0.000 -4577.218 -1512.644 make_volkswagen -1867.0452 1170.975 -1.594 0.113 -4179.072 444.981 make_volvo 3159.7498 1327.405 2.380 0.018 538.862 5780.638 fuel_system_1bbl -2790.4092 2230.161 -1.251 0.213 -7193.740 1612.922 fuel_system_2bbl -648.2498 1094.525 -0.592 0.554 -2809.330 1512.830 fuel_system_4bbl -2326.2983 3094.703 -0.752 0.453 -8436.621 3784.024 fuel_system_idi 1.712e+04 6154.806 2.782 0.006 4971.083 2.93e+04 fuel_system_mfi 926.1109 3063.134 0.302 0.763 -5121.881 6974.102 fuel_system_mpfi 1173.7017 1186.125 0.990 0.324 -1168.238 3515.642 fuel_system_spdi 449.5911 1827.318 0.246 0.806 -3158.349 4057.531 fuel_system_spfi -1858.2133 3111.596 -0.597 0.551 -8001.891 4285.464 engine_type_dohc 2703.6445 1803.080 1.499 0.136 -856.440 6263.729 engine_type_dohcv -9374.0342 3504.717 -2.675 0.008 -1.63e+04 -2454.161 engine_type_l -2130.3416 3357.283 -0.635 0.527 -8759.115 4498.431 engine_type_ohc -1335.2404 1454.047 -0.918 0.360 -4206.177 1535.696 engine_type_ohcf 1.232e+04 2850.883 4.322 0.000 6693.659 1.8e+04 engine_type_ohcv 5755.4074 1669.627 3.447 0.001 2458.820 9051.995 engine_type_rotor 4107.6373 3032.223 1.355 0.177 -1879.323 1.01e+04 num_of_doors_four 6234.8048 3491.722 1.786 0.076 -659.410 1.31e+04 num_of_doors_two 5814.8408 3337.588 1.742 0.083 -775.045 1.24e+04 ============================================================================== Omnibus: 65.777 Durbin-Watson: 1.217 Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594 Skew: 1.059 Prob(JB): 1.70e-87 Kurtosis: 9.504 Cond. No. 1.01e+16 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 5.38e-23. This might indicate that there are strong multicollinearity problems or that the design matrix is singular. """ 这样的变量具有相同的系数。不过,有些人则没有(变量height的{​​{1}}级别,isuzu的{​​{1}}级别或make的级别等等)。两个输出的结果不应该相同吗?我在这里想念什么或做错什么了吗?

预先感谢您的帮助。

  

P.D。正如@sukhbinder所阐明的,即使使用没有独立的Patsy公式   术语(在公式中输入“ -1”,因为Patsy通过   默认值)并从虚拟公式中消除独立项   会收到不同的结果。

1 个答案:

答案 0 :(得分:0)

之所以结果不匹配,是因为Statsmodels根据高多重共线性对预测变量进行了预选择。

通过回归的描述性摘要并确定缺少的变量,可以得到完全相同的结果:

deletex = [
        'make_alfa-romero',
        'fuel_system_1bbl',
        'engine_type_dohc',
        'num_of_doors_four'
        ]
df_num.drop( deletex, axis = 1, inplace = True) 
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
X = df_num.drop('price', axis = 1)
y = df_num.price.values
Xc = sm.add_constant(X) # Adds a constant to the model
random.seed(1234)
linear_regression = sm.OLS(y, Xc)
linear_regression.fit().summary()

打印结果:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:16:08   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
bore                3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio  -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height               -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm              -0.5903      0.790     -0.747      0.456      -2.150       0.970
make_audi           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make_bmw            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make_chevrolet      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make_dodge         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make_honda          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make_isuzu          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make_jaguar         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make_mazda           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make_mercedes-benz  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make_mercury        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make_mitsubishi    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make_nissan        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make_peugot         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make_plymouth       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make_porsche        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make_renault       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make_saab           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make_subaru        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make_toyota         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make_volkswagen      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make_volvo          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system_2bbl    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system_4bbl     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system_idi     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system_mfi     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system_mpfi    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system_spdi    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system_spfi     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type_dohcv  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type_l      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type_ohc    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type_ohcf    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type_ohcv    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type_rotor   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors_two    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

结果完全等于用Statsmodels进行的第一次呼叫:

random.seed(1234)
lm_python = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
lm_python.fit().summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:17:37   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
make[T.audi]           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make[T.bmw]            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make[T.chevrolet]      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make[T.dodge]         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make[T.honda]          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make[T.isuzu]          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make[T.jaguar]         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make[T.mazda]           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make[T.mercedes-benz]  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make[T.mercury]        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make[T.mitsubishi]    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make[T.nissan]        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make[T.peugot]         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make[T.plymouth]       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make[T.porsche]        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make[T.renault]       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make[T.saab]           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make[T.subaru]        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make[T.toyota]         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make[T.volkswagen]      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make[T.volvo]          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system[T.2bbl]    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system[T.4bbl]     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system[T.idi]     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system[T.mfi]     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system[T.mpfi]    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system[T.spdi]    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system[T.spfi]     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type[T.dohcv]  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type[T.l]      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type[T.ohc]    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type[T.ohcf]    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type[T.ohcv]    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type[T.rotor]   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors[T.two]    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
bore                   3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio     -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height                  -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm                 -0.5903      0.790     -0.747      0.456      -2.150       0.970
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

有必要检查预测变量中的对应关系,因为pd.get_dummies会广泛获取所有虚拟变量,而Statsmodels在分类变量选择中应用了N-1级。