在非常大的数据集上拟合线性模型

时间:2018-02-08 00:35:46

标签: python pandas statsmodels

我的数据框中有大约200万条记录。我已经提供了以下数据的示例。我正在尝试使用线性模型,我已经提供了下面的代码。我遇到的问题是,这是永远的。有没有人有任何可以加快速度的提示?即使只抽取20K记录并尝试适合模型也需要永久地在我的机器上。我的最终目标是尝试查看分类变量和loan_amount之间是否存在任何关联。所以我正在拟合线性模型并运行anova。如果有人有一个替代建议,可以更好地处理大数据,我也很满意。

Code:

import statsmodels.api as sm
from statsmodels.formula.api import ols

sample_df['buyer']=sample_df['buyer'].astype('category')
sample_df['seller']=sample_df['seller'].astype('category')
sample_df['property_type']=sample_df['property_type'].astype('category')
sample_df['lender']=sample_df['lender'].astype('category')
sample_df['year_built']=sample_df['year_built'].astype('category')


mod = ols('loan_amount~buyer',
                data=sample_df).fit()

数据:

print sample_df.iloc[1:10]

                id                             property_address  \
1452006   577982716      10407 HILLGATE AVE BAKERSFIELD CA 93311   
1368833   579792639       717 SANTA ISABEL DR SAN DIEGO CA 92114   
1276131   577296028                                          NaN   
3095127   585549018   2300 GREENBRIAR DR #C CHULA VISTA CA 91915   
1032186   574454847                                          NaN   
12094387  570361895         1253 CHAMBERLIN CT CAMPBELL CA 95008   
8622975   567312492  9424 TWIN TRAILS DR #204 SAN DIEGO CA 92129   
1214389   577538408       651 ASPEN MEADOWS WAY LINCOLN CA 95648   
3155561   587054191     950 SHORE POINT CT #206 ALAMEDA CA 94501   

                                      buyer                     seller  \
1452006     BLAYLOCK,ANDREW L|WINTON,SARA A       LENNAR HMS OF CA INC   
1368833   VELAZQUEZ,VICTOR A|ODOM,SHANDEL G                        NaN   
1276131                  ASTON HOLDINGS LLC                MOTE,JOHN C   
3095127              KEOUGH,JOSEPH & SANDRA              KEOUGH,JOSEPH   
1032186           HOMEQUEST INVESTMENTS INC  ANDERSEN MONICA & K TRUST   
12094387       HALLOCK,KAREN A LIVING TRUST            HALLOCK,KAREN A   
8622975        MATAAFA,VALE P & CHRISTINE L                        NaN   
1214389               MONROE,COBY & DESIRAE                        NaN   
3155561                  JACOBSEN,JESSICA L          WONG JENNIE TRUST   

         transaction_date  property_id property_type  transaction_amount  \
1452006        2016-08-08    182232030          RSFR              281000   
1368833        2016-09-27     26293661          RSFR                   0   
1276131        2016-07-06     30827189          VRES                   0   
3095127        2016-12-16     26352345          RCON                   0   
1032186        2016-06-23     29966416          RSFR              300000   
12094387       2016-04-06    104705632          RSFR                   0   
8622975        2016-03-15     25941782          RCON                   0   
1214389        2016-08-18     97638435          RSFR                   0   
3155561        2016-12-16     38308490          RCON              370000   

          loan_amount                     lender  sqft  year_built  trans_yr  
1452006        275664  UNIVERSAL AMERICAN MTG CO  2105           0      2016  
1368833        285000               EQUIFUND MTG  1115        1950      2016  
1276131        450000     AMERICAN AGCREDIT FLCA     0           0      2016  
3095127        225000     FINANCE OF AMERICA MTG  1600        1996      2016  
1032186             0                        NaN   780        1929      2016  
12094387            0                        NaN  3544        2004      2016  
8622975        326947               PLAZA HM MTG  1127        1986      2016  
1214389        356000     FINANCE OF AMERICA MTG  3023        2002      2016  
3155561        333000                       USAA   770        1972      2016 

0 个答案:

没有答案