来自延迟的zip csv

时间:2018-10-19 03:07:38

标签: pandas dask zipfile dask-delayed

我正在尝试从一组压缩的CSV文件中创建一个dask数据框。阅读问题,似乎dask需要使用dask.distributed delay()

import glob
import dask.dataframe as dd
import zipfile
import pandas as pd 
from dask.delayed import delayed

#Create zip_dict with key-value pairs for .zip & .csv names
file_list = glob.glob('my_directory/zip_files/')
zip_dict = {}
for f in file_list:
    key = f.split('/')[5][:-4]
    zip_dict[key] = zipfile.ZipFile(f)
  

zip_dict = {'log20160201'的示例内容:   zipfile.ZipFile filename ='/ my_directory / zip_files / log20160201.zip'   mode ='r','log20160218':zipfile.ZipFile   filename ='/ my_directory / zip_files / log20160218.zip'mode ='r'}

# Create list of delayed pd.read_csv()    
d_rows = []
for k, v in zip_dict.items():

    row = delayed(pd.read_csv)(v.open(k+'.csv'),usecols=['time','cik'])
    d_rows.append(row)
    v.close()
  

d_rows的样本内容=   [已延迟('read_csv-c05dc861-79c3-4e22-8da6-927f5b7da123'),   延迟('read_csv-4fe1c901-44b4-478b-9c11-4a80f7a639e2')]

big_df = dd.from_delayed(d_rows)  

返回的错误是: ValueError:无效的文件路径或缓冲区对象类型:类'list'

1 个答案:

答案 0 :(得分:0)

在这种情况下,我认为您实际上并不需要字典zip_dict来通过Pandas懒惰地读取这些压缩文件。

基于this very similar SO question to read in (.gz) compressed *.csv files using Dask(也显示为here),您可以采用的一种可能的方法是

A。懒洋洋地使用Pandas和dask.delayed读取文件(确保指定您要保留的列的名称)并创建延迟对象列表

B。使用dd.from_delayed转换为单个Dask数据帧,同时指定列(as recommendeddtype-仅需要为您需要的2列指定dtype

import glob
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
from collections import OrderedDict

file_list = glob.glob('my_directory/zip_files/*.zip')

# Lazily reading files into Pandas DataFrames
dfs = [delayed(pd.read_csv)(f, compression='zip', usecols=['time','cik'])
    for f in file_list]

# Specify column dtypes for columns in Dask DataFrame (recommended)
my_dtypes = OrderedDict([("time",int), ("cik",int)])

# Combine into a single Dask DataFrame
ddf = dd.from_delayed(dfs, meta=my_dtypes)

print(type(ddf))
<class 'dask.dataframe.core.DataFrame'>