How can better format the output that I'm attempting to save from several regressions?

时间:2016-07-11 21:46:00

标签: python pandas dictionary formatting

I'd like to loop through several specifications of a linear regression and save the results for each model in a python dictionary. The code below is somewhat successful but additional text (e.g. datatype information) is included in the dictionary making it unreadable. Moreover, regarding the confidence interval, I'd like to have two separate columns - one for the upper and another for the lower-bound - but I'm unable to do that.

code:

import patsy
import statsmodels.api as sm
from collections import defaultdict

colleges = ['ARC_g',u'CCSF_g',u'DAC_g',u'DVC_g',u'LC_g',u'NVC_g',u'SAC_g', u'SRJC_g',u'SC_g',u'SCC_g']
results = defaultdict(lambda: defaultdict(int))

for exog in colleges:
    exog = exog.encode('ascii')
    f1 = 'GRADE_PT_103 ~ %s -1' % exog
    y,X = patsy.dmatrices(f1, data,return_type='dataframe')
    mod = sm.OLS(y, X)    # Describe model

    res = mod.fit()       # Fit model

    results[exog]['beta'] = res.params  
#I'd like the confidence interval to be separated into two columns ('upper' and 'lower')
    results[exog]['CI'] = res.conf_int()
    results[exog]['rsq'] = res.rsquared

pd.DataFrame(results)

______Current output

          ARC_g                      |   CCSF_g                        |  ...
beta  | ARC_g 0.79304 dtype: float64 |  CCSF_g 0.833644 dtype: float64
CI    | 0 1 ARC_g 0.557422 1.0... 0 1|   CCSF_g 0.655746 1...

rsq | 0.122551 | 0.213053

1 个答案:

答案 0 :(得分:2)

这就是我总结你所展示的内容的方式。希望它能帮助你提供一些想法。

import pandas as pd
import statsmodels.formula.api as smf

data = pd.DataFrame(np.random.randn(30, 5), columns=list('YABCD'))

results = {}
for c in data.columns[1:]:
    f = 'Y ~ {}'.format(c)
    r = smf.ols(formula=f, data=data).fit()
    coef = pd.concat([r.params,
                      r.conf_int().iloc[:, 0],
                      r.conf_int().iloc[:, 1]], axis=1, keys=['coef', 'lower', 'upper'])
    coef.index = ['Intercept', 'Beta']
    results[c] = dict(coef=coef, rsq=r.rsquared)


keys = data.columns[1:]
summary = pd.concat([results[k]['coef'].stack() for k in keys], axis=1, keys=keys)
summary.index = summary.index.to_series().str.join(' - ')
summary.append(pd.Series([results[k]['rsq'] for k in keys], keys, name='R Squared'))

enter image description here