加载多个csv文件后在dask中运行操作时出现问题

时间:2019-12-12 14:54:40

标签: python csv dataframe dask

我正尝试开始使用dask在某些ML项目中处理大型数据集。将单个CSV文件加载到dask数据框中可以正常工作。当我尝试使用多个CSV文件时,任何“计算”之类的操作都会导致程序无限期挂起。

运行正常

import dask.dataframe as dd
import pandas as pd
import dask
from dask.distributed import Client

client = Client(processes=False)
df = dd.read_csv('sftp://somestuff//4120109.csv')
shape = dask.delayed(print)(df.shape)
shape.compute()

输出:(3600,3723)

以下代码无限期挂起

import dask.dataframe as dd
import pandas as pd
import dask
from dask.distributed import Client

client = Client(processes=False)
df = dd.read_csv('sftp://somestuff//412010*.csv')
shape = dask.delayed(print)(df.shape)
shape.compute()

它应加载匹配的10个文件,并使其形状为(36000,3273) 我知道在放置一些选择的打印行后,它专门挂在shape.compute()行上。任何帮助将不胜感激!!!

1 个答案:

答案 0 :(得分:0)

请勿混用dask.delayed和dask.dataframe。可能您只是想致电dask.compute(df.shape)

https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections