将数据框dask到镶木地板失败,出现内存错误

时间:2017-12-11 01:02:36

标签: dask dask-distributed dask-delayed

我从多个hdfs文件创建了dask数据帧,然后尝试将最终的数据帧写回hdfs(parquet)。但它因内存错误消息而失败。

dask_df=<some initialization>
for parquet_hdfs_path in hdfs_files:
    df=dd.read_parquet(parquet_hdfs_path)
    dask_df = dask_df.join(df)

dask_df.to_parquet(...)

最终的dask数据框将有100万条记录和600列。

Dask群集大小:5个节点,每个节点有55G内存。

例外:

File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/dask/dataframe/core.py", line 957, in to_parquet
[ worker demo4-dn-1 ] :     return to_parquet(path, self, *args, **kwargs)
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/dask/dataframe/io/parquet.py", line 445, in to_parquet
[ worker demo4-dn-1 ] :     writes = delayed(writes).compute()
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/dask/base.py", line 99, in compute
[ worker demo4-dn-1 ] :     (result,) = compute(self, traverse=False, **kwargs)
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/dask/base.py", line 206, in compute
[ worker demo4-dn-1 ] :     results = get(dsk, keys, **kwargs)
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/distributed/client.py", line 1934, in get
[ worker demo4-dn-1 ] :     results = self.gather(packed, asynchronous=asynchronous)
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/distributed/client.py", line 1376, in gather
[ worker demo4-dn-1 ] :     asynchronous=asynchronous)
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/distributed/client.py", line 548, in sync
[ worker demo4-dn-1 ] :     return sync(self.loop, func, *args, **kwargs)
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/distributed/utils.py", line 241, in sync
[ worker demo4-dn-1 ] :     six.reraise(*error[0])
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/six.py", line 693, in reraise
[ worker demo4-dn-1 ] :     raise value
[ worker demo4-dn-1 ] :   File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/distributed/utils.py", line 229, in f
[ worker demo4-dn-1 ] :     result[0] = yield make_coro()

请指出我正确的方向来克服这个问题。

0 个答案:

没有答案