我正在尝试使用python和dask将大量的.csv大文件转换为实木复合地板格式。
这是我使用的代码:
trans = dd.read_csv(os.path.join(TRANS_PATH, "*.TXT"),
sep=";", dtype=col_types, parse_dates=['salesdate'])
trans = trans.drop('salestime', axis=1)
trans['month_year'] = trans['salesdate'].dt.strftime('M_%Y_%m')
trans['chainid'] = '41'
trans['key'] = trans['chainid'] + trans['barcode']
trans = trans.join(attribs[['catcode']], on=['key'])
start = default_timer()
trans.to_parquet(PARQUET_PATH, engine="fastparquet", compression='snappy',
partition_on=['catcode', 'month_year'], append=False)
end = default_timer()
print("Done in {} secs.".format(end - start) )
代码似乎运行良好,并且所有镶木地板文件都在正确的目录下创建,几乎没有警告,直到最后。图形执行到这里为止:
此时,程序被卡住一分钟左右,然后出现以下警告:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 16996 was killed by signal 15
distributed.nanny - WARNING - Restarting worker
这种情况发生了三次,程序最终崩溃了:
---------------------------------------------------------------------------
KilledWorker Traceback (most recent call last)
<ipython-input-8-f644e4fa53ea> in <module>()
27 start = default_timer()
28 trans.to_parquet(PARQUET_PATH, engine="fastparquet", compression='snappy',
---> 29 partition_on=['catcode', 'month_year'], append=False)
有人知道为什么会这样吗?