工作日作为使用statsmodels的线性回归模型中的因子变量

时间:2018-05-03 13:11:26

标签: python regression statsmodels dummy-variable

问题:

如何使用sm.OLS()

将因子变量添加到模型中

详情

下面是一个可重现的数据框,您可以使用 ctrl + C 进行选择,然后再向下运行该代码段以获得可重现的示例。

输入数据:

Date    A   B   weekday
2013-05-04  25.03   88.51   Saturday
2013-05-05  52.98   67.99   Sunday
2013-05-06  39.93   75.19   Monday
2013-05-07  47.31   86.99   Tuesday
2013-05-08  19.61   87.94   Wednesday
2013-05-09  39.51   83.10   Thursday
2013-05-10  21.22   62.16   Friday
2013-05-11  19.04   58.79   Saturday
2013-05-12  18.53   75.27   Sunday
2013-05-13  11.90   75.43   Monday
2013-05-14  47.64   64.76   Tuesday
2013-05-15  27.47   91.65   Wednesday
2013-05-16  11.20   59.83   Thursday
2013-05-17  25.10   67.47   Friday
2013-05-18  19.89   64.70   Saturday
2013-05-19  38.91   76.68   Sunday
2013-05-20  42.11   94.36   Monday
2013-05-21  7.845   73.67   Tuesday
2013-05-22  35.45   76.67   Wednesday
2013-05-23  29.43   79.05   Thursday
2013-05-24  33.51   78.53   Friday
2013-05-25  13.58   59.26   Saturday
2013-05-26  37.38   68.59   Sunday
2013-05-27  37.09   67.79   Monday
2013-05-28  21.70   70.54   Tuesday
2013-05-29  11.85   60.00   Wednesday

使用statsmodels进行回归分析的代码:

以下使用sm.ols()在A上创建B的线性回归模型(包括使用sm.add_constant()的常数术语)

import pandas as pd
import statsmodels.api as sm

df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

输出(缩短):

                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -1.4328     17.355     -0.083      0.935       -37.252    34.386
B              0.4034      0.233      1.729      0.097        -0.078     0.885
==============================================================================

现在我想将工作日添加为解释因素变量。我希望它可以像更改数据框中的数据类型一样简单,但遗憾的是,虽然该列已被x = sm.add_constant(independent)部分接受,但似乎无法正常工作。

import pandas as pd
import statsmodels.api as sm

df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)

independent = df[['B', 'weekday']]
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

当你来到model = sm.OLS(df['A'], x).fit()部分时,会引发一个值错误:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

还有其他建议吗?

1 个答案:

答案 0 :(得分:1)

您可以使用pandas分类来创建虚拟变量,或者更简单地使用公式接口,其中patsy将所有非数字列转换为虚拟变量或其他因子编码。

在这种情况下使用公式界面(与statsmodels.formula.api中的小写ols相同)显示以下结果。 Patsy按字母顺序对分类变量的级别进行排序。 '周五'在变量列表中缺失,并已被选为参考类别。

>>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
>>> print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.301
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     1.105
Date:                Thu, 03 May 2018   Prob (F-statistic):              0.401
Time:                        15:26:02   Log-Likelihood:                -97.898
No. Observations:                  26   AIC:                             211.8
Df Residuals:                      18   BIC:                             221.9
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -1.4717     19.343     -0.076      0.940     -42.110      39.167
weekday[T.Monday]        2.5837      9.857      0.262      0.796     -18.124      23.291
weekday[T.Saturday]     -6.5889      9.599     -0.686      0.501     -26.755      13.577
weekday[T.Sunday]        9.2287      9.616      0.960      0.350     -10.975      29.432
weekday[T.Thursday]     -1.7610     10.321     -0.171      0.866     -23.445      19.923
weekday[T.Tuesday]       2.6507      9.664      0.274      0.787     -17.652      22.953
weekday[T.Wendesday]    -6.9320      9.911     -0.699      0.493     -27.754      13.890
B                        0.4047      0.258      1.566      0.135      -0.138       0.948
==============================================================================
Omnibus:                        1.039   Durbin-Watson:                   2.313
Prob(Omnibus):                  0.595   Jarque-Bera (JB):                0.532
Skew:                          -0.350   Prob(JB):                        0.766
Kurtosis:                       3.007   Cond. No.                         638.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

有关分类编码http://patsy.readthedocs.io/en/latest/categorical-coding.html

的选项,请参阅patsy文档

例如,参考编码可以明确指定,如此公式

"A ~ B + C(weekday, Treatment('Sunday'))"

http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment