迭代OLS模型使用Python Pandas和statsmodels运行得非常慢? (数据帧的不当使用 - 可能!)

时间:2015-03-30 06:47:52

标签: python pandas regression statsmodels

我使用Stats-model和Pandas来自动化为各种变量组合运行线性回归的迭代过程。总的来说,变量组合达到697,343。这是很多OLS计算,但我不认为它需要很长时间(超过1小时)。 X可以高达18x18,Y总是18X1。

有人可以告知我使用的代码是否未经优化?并可能建议一个解决方案?

import time
import pandas
import statsmodels.api as sm
perm = pandas.read_pickle('C:\SharedData\Temp\ResultTestDataframes\perm')
BB=pandas.read_pickle('C:\SharedData\Temp\ResultTestDataframes\BB')
wdb_demog=pandas.read_pickle("C:/SharedData/Temp/ResultTestDataframes/wdb_demog")
wdb_hts=pandas.read_pickle("C:/SharedData/Temp/ResultTestDataframes/wdb_hts")

result_db= pandas.DataFrame(columns=('R-squared value','Adj. R-squared','F-statistic','Prob (F-statistic)','coefficeints','Variables'))
row=-1
for v in range(len(perm)):
    row+=1
    variables_columns=list(set(perm.loc[v]))
    if None in variables_columns:
        variables_columns.remove(None)   
    X= pandas.DataFrame(BB[variables_columns]).values.tolist()
    Y= pandas.DataFrame(BB[wdb_hts.columns.values[1]]).values.tolist()    
    model = sm.OLS(Y,X)
    results = model.fit()
    R=[round(results.rsquared,4),
       round(results.rsquared_adj,4),
       round(results.fvalue,4),
       round(results.f_pvalue,4),
       list(results.params),
       list(variables_columns)] 
    result_db.loc[row]= pandas.Series(R, index=result_db.columns)

result_db.to_pickle("C:/SharedData/Temp/ResultTestDataframes/TEST")
print "done! " + time.strftime("%c")

--------------------

# BB is the DataFrame (18 rows × 90 columns ) 
# perm is the DataFrame (697343 × 17) that has all the combinations of variables' . The X  (exogenous variables) will be built using the given combination of variables and the  data in BB data frame   
# wdb_hts is another data frame to read the variables name to construct the Y (endogenous variables)

0 个答案:

没有答案