问题:
如何使用sm.OLS()
?
详情
下面是一个可重现的数据框,您可以使用 ctrl + C 进行选择,然后再向下运行该代码段以获得可重现的示例。
输入数据:
Date A B weekday
2013-05-04 25.03 88.51 Saturday
2013-05-05 52.98 67.99 Sunday
2013-05-06 39.93 75.19 Monday
2013-05-07 47.31 86.99 Tuesday
2013-05-08 19.61 87.94 Wednesday
2013-05-09 39.51 83.10 Thursday
2013-05-10 21.22 62.16 Friday
2013-05-11 19.04 58.79 Saturday
2013-05-12 18.53 75.27 Sunday
2013-05-13 11.90 75.43 Monday
2013-05-14 47.64 64.76 Tuesday
2013-05-15 27.47 91.65 Wednesday
2013-05-16 11.20 59.83 Thursday
2013-05-17 25.10 67.47 Friday
2013-05-18 19.89 64.70 Saturday
2013-05-19 38.91 76.68 Sunday
2013-05-20 42.11 94.36 Monday
2013-05-21 7.845 73.67 Tuesday
2013-05-22 35.45 76.67 Wednesday
2013-05-23 29.43 79.05 Thursday
2013-05-24 33.51 78.53 Friday
2013-05-25 13.58 59.26 Saturday
2013-05-26 37.38 68.59 Sunday
2013-05-27 37.09 67.79 Monday
2013-05-28 21.70 70.54 Tuesday
2013-05-29 11.85 60.00 Wednesday
使用statsmodels进行回归分析的代码:
以下使用sm.ols()
在A上创建B的线性回归模型(包括使用sm.add_constant()
的常数术语)
import pandas as pd
import statsmodels.api as sm
df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
输出(缩短):
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -1.4328 17.355 -0.083 0.935 -37.252 34.386
B 0.4034 0.233 1.729 0.097 -0.078 0.885
==============================================================================
现在我想将工作日添加为解释因素变量。我希望它可以像更改数据框中的数据类型一样简单,但遗憾的是,虽然该列已被x = sm.add_constant(independent)
部分接受,但似乎无法正常工作。
import pandas as pd
import statsmodels.api as sm
df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df[['B', 'weekday']]
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
当你来到model = sm.OLS(df['A'], x).fit()
部分时,会引发一个值错误:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
还有其他建议吗?
答案 0 :(得分:1)
您可以使用pandas分类来创建虚拟变量,或者更简单地使用公式接口,其中patsy将所有非数字列转换为虚拟变量或其他因子编码。
在这种情况下使用公式界面(与statsmodels.formula.api中的小写ols
相同)显示以下结果。
Patsy按字母顺序对分类变量的级别进行排序。 '周五'在变量列表中缺失,并已被选为参考类别。
>>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
>>> print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.301
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 1.105
Date: Thu, 03 May 2018 Prob (F-statistic): 0.401
Time: 15:26:02 Log-Likelihood: -97.898
No. Observations: 26 AIC: 211.8
Df Residuals: 18 BIC: 221.9
Df Model: 7
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept -1.4717 19.343 -0.076 0.940 -42.110 39.167
weekday[T.Monday] 2.5837 9.857 0.262 0.796 -18.124 23.291
weekday[T.Saturday] -6.5889 9.599 -0.686 0.501 -26.755 13.577
weekday[T.Sunday] 9.2287 9.616 0.960 0.350 -10.975 29.432
weekday[T.Thursday] -1.7610 10.321 -0.171 0.866 -23.445 19.923
weekday[T.Tuesday] 2.6507 9.664 0.274 0.787 -17.652 22.953
weekday[T.Wendesday] -6.9320 9.911 -0.699 0.493 -27.754 13.890
B 0.4047 0.258 1.566 0.135 -0.138 0.948
==============================================================================
Omnibus: 1.039 Durbin-Watson: 2.313
Prob(Omnibus): 0.595 Jarque-Bera (JB): 0.532
Skew: -0.350 Prob(JB): 0.766
Kurtosis: 3.007 Cond. No. 638.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
有关分类编码http://patsy.readthedocs.io/en/latest/categorical-coding.html
的选项,请参阅patsy文档例如,参考编码可以明确指定,如此公式
"A ~ B + C(weekday, Treatment('Sunday'))"
http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment