为了更好地了解python中的dask
库,我试图在使用dask和不使用dask之间进行公平的比较。我使用h5py
创建了一个大数据集,稍后将其用作numpy样式操作来测量一个轴的均值。
我想知道我所做的工作是否真的可以公平地比较一下,以检查dask是否可以并行运行代码。我正在阅读h5py
和dask
的文档,所以我想出了这个小实验。
到目前为止,我所做的是:
使用h5py创建(写入)数据集。这是通过使用maxshape
和resize
替代方法来附加数据来避免将整个日期立即加载到内存中,从而避免出现内存问题。
使用“经典代码”估算一个轴上的简单操作(测量平均值),这意味着每1000行估算平均值。
重复上一步,但是这次使用dask。
所以第一步,这就是我到目前为止所得到的:
# Write h5 dataset
chunks = (100,500,2)
tp = time.time()
with h5py.File('path/3D_matrix_1.hdf5', 'w') as f:
# create a 3D dataset inside one h5py file
dset = f.create_dataset('3D_matrix', (10000, 5000,2), chunks = chunks, maxshape= (None,5000,2 ),compression = 'gzip') # to append dato on axis 0
print(dset.shape)
while dset.shape < 4*10**7: # append data until axis 0 = 4*10**7
dset.resize(dset.shape[0]+10**4, axis=0) # resize data
print(dset.shape) # check new shape for each append
dset[-10**4:] = np.random.randint(2, size=(10**4, 5000, 2))
tmp = time.time()- tp
print('Writting time: {}'.format(tmp))
第二步,我读取了先前的数据集并测量了估计均值的时间。
# Classical read
tp = time.time()
filename = 'path/3D_matrix_1.hdf5'
with h5py.File(filename, mode='r') as f:
# List all groups (actually there is only one)
a_group_key = list(f.keys())[0] # group the only one dataset in h5 File.
# Get the data
result = f.get(a_group_key)
print(result.shape)
#print(type(result))
# read each 1000 elements
start_ = 0 # initialize a start counter
means = []
while start_ < result.shape[0]:
arr = np.array(result[start_:start_+1000])
m = arr.mean()
#print(m)
means.append(m)
start_ += 1000
final_mean = np.array(means).mean()
print(final_mean, len(final_mean))
tmp = time.time()- tp
print('Total reading and measuring time withouth dask: {:.2f}'.format(tmp))
作为第三步的方法,我进行如下操作:
# Dask way
from dask import delayed
tp = time.time()
import dask.array as da
filename = 'path/3D_matrix_1.hdf5'
dset = h5py.File(filename, 'r')
dataset_names = list(dset.keys())[0] # to obtain dataset content
result = dset.get(dataset_names)
array = da.from_array(result,chunks = chunks) # should this be paralelized with delayed?
print('Gigabytes del input: {}'.format(array.nbytes / 1e9)) # Gigabytes of the input processed lazily
x = delayed(array.mean(axis=0)) # use delayed to parallelize (kind of...)
print('Mean array: {}'.format(x.compute()))
tmp = time.time() - tp
print('Total reading and measuring time with dask: {:.2f}'.format(tmp))
我认为我缺少一些程序,似乎执行时间比古典方法要花时间。此外,我认为块选项可能是造成这种情况的原因,因为我对h5
数据集和dask
使用了相同的块大小。
欢迎对此程序提出任何建议。