我正在尝试从一组压缩的CSV文件中创建一个dask数据框。阅读问题,似乎dask需要使用dask.distributed delay()
import glob
import dask.dataframe as dd
import zipfile
import pandas as pd
from dask.delayed import delayed
#Create zip_dict with key-value pairs for .zip & .csv names
file_list = glob.glob('my_directory/zip_files/')
zip_dict = {}
for f in file_list:
key = f.split('/')[5][:-4]
zip_dict[key] = zipfile.ZipFile(f)
zip_dict = {'log20160201'的示例内容: zipfile.ZipFile filename ='/ my_directory / zip_files / log20160201.zip' mode ='r','log20160218':zipfile.ZipFile filename ='/ my_directory / zip_files / log20160218.zip'mode ='r'}
# Create list of delayed pd.read_csv()
d_rows = []
for k, v in zip_dict.items():
row = delayed(pd.read_csv)(v.open(k+'.csv'),usecols=['time','cik'])
d_rows.append(row)
v.close()
d_rows的样本内容= [已延迟('read_csv-c05dc861-79c3-4e22-8da6-927f5b7da123'), 延迟('read_csv-4fe1c901-44b4-478b-9c11-4a80f7a639e2')]
big_df = dd.from_delayed(d_rows)
返回的错误是: ValueError:无效的文件路径或缓冲区对象类型:类'list'
答案 0 :(得分:0)
在这种情况下,我认为您实际上并不需要字典zip_dict
来通过Pandas懒惰地读取这些压缩文件。
基于this very similar SO question to read in (.gz
) compressed *.csv
files using Dask(也显示为here),您可以采用的一种可能的方法是
A。懒洋洋地使用Pandas和dask.delayed
读取文件(确保指定您要保留的列的名称)并创建延迟对象列表
B。使用dd.from_delayed
转换为单个Dask数据帧,同时指定列(as recommended的dtype
-仅需要为您需要的2列指定dtype
>
import glob
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
from collections import OrderedDict
file_list = glob.glob('my_directory/zip_files/*.zip')
# Lazily reading files into Pandas DataFrames
dfs = [delayed(pd.read_csv)(f, compression='zip', usecols=['time','cik'])
for f in file_list]
# Specify column dtypes for columns in Dask DataFrame (recommended)
my_dtypes = OrderedDict([("time",int), ("cik",int)])
# Combine into a single Dask DataFrame
ddf = dd.from_delayed(dfs, meta=my_dtypes)
print(type(ddf))
<class 'dask.dataframe.core.DataFrame'>