I'd like to loop through several specifications of a linear regression and save the results for each model in a python dictionary. The code below is somewhat successful but additional text (e.g. datatype information) is included in the dictionary making it unreadable. Moreover, regarding the confidence interval, I'd like to have two separate columns - one for the upper and another for the lower-bound - but I'm unable to do that.
code:
import patsy
import statsmodels.api as sm
from collections import defaultdict
colleges = ['ARC_g',u'CCSF_g',u'DAC_g',u'DVC_g',u'LC_g',u'NVC_g',u'SAC_g', u'SRJC_g',u'SC_g',u'SCC_g']
results = defaultdict(lambda: defaultdict(int))
for exog in colleges:
exog = exog.encode('ascii')
f1 = 'GRADE_PT_103 ~ %s -1' % exog
y,X = patsy.dmatrices(f1, data,return_type='dataframe')
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
results[exog]['beta'] = res.params
#I'd like the confidence interval to be separated into two columns ('upper' and 'lower')
results[exog]['CI'] = res.conf_int()
results[exog]['rsq'] = res.rsquared
pd.DataFrame(results)
______Current output
ARC_g | CCSF_g | ...
beta | ARC_g 0.79304 dtype: float64 | CCSF_g 0.833644 dtype: float64
CI | 0 1 ARC_g 0.557422 1.0... 0 1| CCSF_g 0.655746 1...
rsq | 0.122551 | 0.213053
答案 0 :(得分:2)
这就是我总结你所展示的内容的方式。希望它能帮助你提供一些想法。
import pandas as pd
import statsmodels.formula.api as smf
data = pd.DataFrame(np.random.randn(30, 5), columns=list('YABCD'))
results = {}
for c in data.columns[1:]:
f = 'Y ~ {}'.format(c)
r = smf.ols(formula=f, data=data).fit()
coef = pd.concat([r.params,
r.conf_int().iloc[:, 0],
r.conf_int().iloc[:, 1]], axis=1, keys=['coef', 'lower', 'upper'])
coef.index = ['Intercept', 'Beta']
results[c] = dict(coef=coef, rsq=r.rsquared)
keys = data.columns[1:]
summary = pd.concat([results[k]['coef'].stack() for k in keys], axis=1, keys=keys)
summary.index = summary.index.to_series().str.join(' - ')
summary.append(pd.Series([results[k]['rsq'] for k in keys], keys, name='R Squared'))