我有一个大数据帧加载到内存中的Pandas(~9GB)。我试图写出一个遵循给定格式(Vowpal Wabbit)的文本文件,并对内存使用和性能感到困惑。虽然文件很大(4800万行),但Pandas的初始负载并不差。写出文件需要至少6个多小时,然后压碎我的笔记本电脑几乎消耗了我的每一块RAM(32GB)。天真地,我假设这个操作一次只能在一行上运行,所以RAM的使用量非常小。有没有更有效的方法来处理这些数据?
with open("C:\\Users\\Desktop\\DATA\\train_mobile2.vw", "wb") as outfile:
for index, row in train.iterrows():
if row['click'] ==0:
vwline=""
vwline+="-1 "
else:
vwline=""
vwline+="1 "
vwline+="|a C1_"+ str(row['C1']) +\
" |b banpos_"+ str(row['banner_pos']) +\
" |c siteid_"+ str(row['site_id']) +\
" sitedom_"+ str(row['site_domain']) +\
" sitecat_"+ str(row['site_category']) +\
" |d appid_"+ str(row['app_id']) +\
" app_domain_"+ str(row['app_domain']) +\
" app_cat_"+ str(row['app_category']) +\
" |e d_id_"+ str(row['device_id']) +\
" d_ip_"+ str(row['device_ip']) +\
" d_os_"+ str(row['device_os']) +\
" d_make_"+ str(row['device_make']) +\
" d_mod_"+ str(row['device_model']) +\
" d_type_"+ str(row['device_type']) +\
" d_conn_"+ str(row['device_conn_type']) +\
" d_geo_"+ str(row['device_geo_country']) +\
" |f num_a:"+ str(row['C17']) +\
" numb:"+ str(row['C18']) +\
" numc:"+ str(row['C19']) +\
" numd:"+ str(row['C20']) +\
" nume:"+ str(row['C22']) +\
" numf:"+ str(row['C24']) +\
" |g c21_"+ str(row['C21']) +\
" C23_"+ str(row['C23']) +\
" |h hh_"+ str(row['hh']) +\
" |i doe_"+ str(row['doe'])
outfile.write(vwline + "\n")
回应用户的建议,
我对以下内容进行了编码,但是当它运行的最后一行显示"不支持的操作数类型为+:' numpy.ndarray'时出现错误。和' str'"
lines_T = np.where(train['click'] == 0, "-1 ", "1 ") +\
"|a C1_" + train['C1'].astype('str') +\
" |b banpos_"+ train['banner_pos'].astype('str') +\
....
"|h hh_"+ train['hh'].astype('str')+\
" |i doe_"+ train['doe'].astype('str') #ERROR HERE
line_T.to_csv(" C:\ Users \ Desktop \ DATA \ KAGGLE \ mobile \ train_mobile.vw",mode =' a',header = False,index = False)
答案 0 :(得分:1)
不确定内存使用情况,但这肯定会更快:
lines = np.where(train['click'] == 0, "-1 ", "1 ") +
"|a C1_" + train['C1'].astype('str') +
" |b banpos_"+ train['banner_pos'].astype('str') +
...
然后保存行
lines.to_csv(outfile, index=False)
如果内存成为问题,你也可以分批(比如几百万条记录)进行分类