我的数据框中有大约200万条记录。我已经提供了以下数据的示例。我正在尝试使用线性模型,我已经提供了下面的代码。我遇到的问题是,这是永远的。有没有人有任何可以加快速度的提示?即使只抽取20K记录并尝试适合模型也需要永久地在我的机器上。我的最终目标是尝试查看分类变量和loan_amount之间是否存在任何关联。所以我正在拟合线性模型并运行anova。如果有人有一个替代建议,可以更好地处理大数据,我也很满意。
Code:
import statsmodels.api as sm
from statsmodels.formula.api import ols
sample_df['buyer']=sample_df['buyer'].astype('category')
sample_df['seller']=sample_df['seller'].astype('category')
sample_df['property_type']=sample_df['property_type'].astype('category')
sample_df['lender']=sample_df['lender'].astype('category')
sample_df['year_built']=sample_df['year_built'].astype('category')
mod = ols('loan_amount~buyer',
data=sample_df).fit()
数据:
print sample_df.iloc[1:10]
id property_address \
1452006 577982716 10407 HILLGATE AVE BAKERSFIELD CA 93311
1368833 579792639 717 SANTA ISABEL DR SAN DIEGO CA 92114
1276131 577296028 NaN
3095127 585549018 2300 GREENBRIAR DR #C CHULA VISTA CA 91915
1032186 574454847 NaN
12094387 570361895 1253 CHAMBERLIN CT CAMPBELL CA 95008
8622975 567312492 9424 TWIN TRAILS DR #204 SAN DIEGO CA 92129
1214389 577538408 651 ASPEN MEADOWS WAY LINCOLN CA 95648
3155561 587054191 950 SHORE POINT CT #206 ALAMEDA CA 94501
buyer seller \
1452006 BLAYLOCK,ANDREW L|WINTON,SARA A LENNAR HMS OF CA INC
1368833 VELAZQUEZ,VICTOR A|ODOM,SHANDEL G NaN
1276131 ASTON HOLDINGS LLC MOTE,JOHN C
3095127 KEOUGH,JOSEPH & SANDRA KEOUGH,JOSEPH
1032186 HOMEQUEST INVESTMENTS INC ANDERSEN MONICA & K TRUST
12094387 HALLOCK,KAREN A LIVING TRUST HALLOCK,KAREN A
8622975 MATAAFA,VALE P & CHRISTINE L NaN
1214389 MONROE,COBY & DESIRAE NaN
3155561 JACOBSEN,JESSICA L WONG JENNIE TRUST
transaction_date property_id property_type transaction_amount \
1452006 2016-08-08 182232030 RSFR 281000
1368833 2016-09-27 26293661 RSFR 0
1276131 2016-07-06 30827189 VRES 0
3095127 2016-12-16 26352345 RCON 0
1032186 2016-06-23 29966416 RSFR 300000
12094387 2016-04-06 104705632 RSFR 0
8622975 2016-03-15 25941782 RCON 0
1214389 2016-08-18 97638435 RSFR 0
3155561 2016-12-16 38308490 RCON 370000
loan_amount lender sqft year_built trans_yr
1452006 275664 UNIVERSAL AMERICAN MTG CO 2105 0 2016
1368833 285000 EQUIFUND MTG 1115 1950 2016
1276131 450000 AMERICAN AGCREDIT FLCA 0 0 2016
3095127 225000 FINANCE OF AMERICA MTG 1600 1996 2016
1032186 0 NaN 780 1929 2016
12094387 0 NaN 3544 2004 2016
8622975 326947 PLAZA HM MTG 1127 1986 2016
1214389 356000 FINANCE OF AMERICA MTG 3023 2002 2016
3155561 333000 USAA 770 1972 2016