我使用Stats-model和Pandas来自动化为各种变量组合运行线性回归的迭代过程。总的来说,变量组合达到697,343。这是很多OLS计算,但我不认为它需要很长时间(超过1小时)。 X可以高达18x18,Y总是18X1。
有人可以告知我使用的代码是否未经优化?并可能建议一个解决方案?
import time
import pandas
import statsmodels.api as sm
perm = pandas.read_pickle('C:\SharedData\Temp\ResultTestDataframes\perm')
BB=pandas.read_pickle('C:\SharedData\Temp\ResultTestDataframes\BB')
wdb_demog=pandas.read_pickle("C:/SharedData/Temp/ResultTestDataframes/wdb_demog")
wdb_hts=pandas.read_pickle("C:/SharedData/Temp/ResultTestDataframes/wdb_hts")
result_db= pandas.DataFrame(columns=('R-squared value','Adj. R-squared','F-statistic','Prob (F-statistic)','coefficeints','Variables'))
row=-1
for v in range(len(perm)):
row+=1
variables_columns=list(set(perm.loc[v]))
if None in variables_columns:
variables_columns.remove(None)
X= pandas.DataFrame(BB[variables_columns]).values.tolist()
Y= pandas.DataFrame(BB[wdb_hts.columns.values[1]]).values.tolist()
model = sm.OLS(Y,X)
results = model.fit()
R=[round(results.rsquared,4),
round(results.rsquared_adj,4),
round(results.fvalue,4),
round(results.f_pvalue,4),
list(results.params),
list(variables_columns)]
result_db.loc[row]= pandas.Series(R, index=result_db.columns)
result_db.to_pickle("C:/SharedData/Temp/ResultTestDataframes/TEST")
print "done! " + time.strftime("%c")
--------------------
# BB is the DataFrame (18 rows × 90 columns )
# perm is the DataFrame (697343 × 17) that has all the combinations of variables' . The X (exogenous variables) will be built using the given combination of variables and the data in BB data frame
# wdb_hts is another data frame to read the variables name to construct the Y (endogenous variables)