在编写镶木地板文件时,Dask工作者的内存即将用完

时间:2018-08-10 08:42:25

标签: python parquet dask

我正在尝试使用python和dask将大量的.csv大文件转换为实木复合地板格式。

这是我使用的代码:

trans = dd.read_csv(os.path.join(TRANS_PATH, "*.TXT"), 
                        sep=";", dtype=col_types, parse_dates=['salesdate'])

trans = trans.drop('salestime', axis=1)
trans['month_year'] = trans['salesdate'].dt.strftime('M_%Y_%m')

trans['chainid'] = '41'
trans['key'] = trans['chainid'] + trans['barcode']
trans = trans.join(attribs[['catcode']], on=['key'])

start = default_timer()
trans.to_parquet(PARQUET_PATH, engine="fastparquet", compression='snappy',
                     partition_on=['catcode', 'month_year'], append=False)
end = default_timer()
print("Done in {} secs.".format(end - start) )

代码似乎运行良好,并且所有镶木地板文件都在正确的目录下创建,几乎没有警告,直到最后。图形执行到这里为止: enter image description here

此时,程序被卡住一分钟左右,然后出现以下警告:

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 16996 was killed by signal 15
distributed.nanny - WARNING - Restarting worker

警告后,数百个进程重新启动并再次执行: enter image description here

这种情况发生了三次,程序最终崩溃了:

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<ipython-input-8-f644e4fa53ea> in <module>()
     27 start = default_timer()
     28 trans.to_parquet(PARQUET_PATH, engine="fastparquet", compression='snappy',
---> 29                      partition_on=['catcode', 'month_year'], append=False)

有人知道为什么会这样吗?

0 个答案:

没有答案