Question

我试图用Python在Dask中打开一个csv.gzip文件。我将逐步解释我的代码。

首先，我用dask.dataframe.read_csv打开文件。在这一步，我指定dtype并将'Date[G]','Time[G]'转换为单列。

dtype_dict= {'#RIC': 'str', 'Price': 'float', 'Volume': 'float'} 
df=dd.read_csv(f, compression='gzip',header=0, sep=',',
           quotechar='"',usecols=['#RIC','Date[G]','Time[G]','Price','Volume'],
                                  blocksize=None,parse_dates=[['Date[G]','Time[G]']],dtype=dtype_dict)

此后，我将所有NA都放在'Price','Volume'列中，并将合并的列'Date[G]_Time[G]'设置为索引，但不删除该列，因为以后我仍然需要它。

df= df.dropna(subset=['Price','Volume'])
df=df.set_index('Date[G]_Time[G]', drop=False)

然后我尝试再次拆分该'Date[G]_Time[G]'列，因为我的输出文件需要在两个单独的列中显示日期和时间。我知道必须有更好的方法来解决这个问题，我只是找不到它。

df['Date[G]'] = dd.to_datetime(df['Date[G]_Time[G]']).dt.date
df['Time[G]'] = dd.to_datetime(df['Date[G]_Time[G]']).dt.time
df=df.drop(['Date[G]_Time[G]'],axis=1)

之后，我将该数据框附加到列表中。我有一堆csv.gz文件，我想打开所有文件，然后以日历年的频率重新分区这个大数据框。

dl=[]
df_concated=dl.append(df)
df_concated.repartition(freq='A')

我知道默认情况下，dask可能真的很慢，我只是不知道如何设置它，这让我真的很沮丧。有谁知道如何优化我的代码？

样本数据。

 #RIC   Date[G]   Time[G]         Price Volume
 VZC.L 2014-05-01 06:16:00.480000 46.64 88.0
 VZC.L 2014-05-01 06:16:00.800000 46.64 33.0
 VZC.L 2014-05-01 06:16:00.890000 46.69 20.0
 VZC.L 2014-05-01 06:16:00.980000 46.69 40.0
 VZC.L 2014-05-01 06:16:01.330000 46.67 148.0

Answer 1

问题可能出在read_csv参数parse_dates上。

parse_dates = [['Date [G]'，'Time [G]']]

尝试加载不带parse_dates作为object类型的文件，然后将该字段转换为datetime。
参见this answer

如何在python中使用dask加速我的代码？

1 个答案: