Question

为了更好地了解python中的dask库，我试图在使用dask和不使用dask之间进行公平的比较。我使用h5py创建了一个大数据集，稍后将其用作numpy样式操作来测量一个轴的均值。

我想知道我所做的工作是否真的可以公平地比较一下，以检查dask是否可以并行运行代码。我正在阅读h5py和dask的文档，所以我想出了这个小实验。

到目前为止，我所做的是：

使用h5py创建（写入）数据集。这是通过使用maxshape和resize替代方法来附加数据来避免将整个日期立即加载到内存中，从而避免出现内存问题。
使用“经典代码”估算一个轴上的简单操作（测量平均值），这意味着每1000行估算平均值。
重复上一步，但是这次使用dask。

所以第一步，这就是我到目前为止所得到的：

# Write h5 dataset
chunks = (100,500,2)
tp = time.time()
with h5py.File('path/3D_matrix_1.hdf5', 'w') as f:
    # create a 3D dataset inside one h5py file
    dset = f.create_dataset('3D_matrix', (10000, 5000,2), chunks = chunks, maxshape= (None,5000,2 ),compression = 'gzip') # to append dato on axis 0
    print(dset.shape)
    while dset.shape < 4*10**7: # append data until axis 0 = 4*10**7
        dset.resize(dset.shape[0]+10**4, axis=0)  # resize data
        print(dset.shape) # check new shape for each append
        dset[-10**4:] = np.random.randint(2, size=(10**4, 5000, 2))
    tmp  = time.time()- tp
    print('Writting time: {}'.format(tmp))

第二步，我读取了先前的数据集并测量了估计均值的时间。

# Classical read
tp = time.time()
filename = 'path/3D_matrix_1.hdf5'
with h5py.File(filename, mode='r') as f:
    # List all groups (actually there is only one)
    a_group_key = list(f.keys())[0] # group the only one dataset in h5 File.
    # Get the data
    result = f.get(a_group_key)
    print(result.shape)
    #print(type(result))
    # read each 1000 elements
    start_ = 0 # initialize a start counter
    means = []
    while start_ < result.shape[0]:
        arr = np.array(result[start_:start_+1000])
        m = arr.mean()
        #print(m)
        means.append(m)
        start_ += 1000
    final_mean = np.array(means).mean()
    print(final_mean, len(final_mean))
    tmp  = time.time()- tp
    print('Total reading and measuring time withouth dask: {:.2f}'.format(tmp))

作为第三步的方法，我进行如下操作：

# Dask way
from dask import delayed
tp = time.time()
import dask.array as da
filename = 'path/3D_matrix_1.hdf5'
dset = h5py.File(filename, 'r')
dataset_names = list(dset.keys())[0] # to obtain dataset content
result = dset.get(dataset_names)
array = da.from_array(result,chunks = chunks) # should this be paralelized with delayed?
print('Gigabytes del input: {}'.format(array.nbytes / 1e9))  # Gigabytes of the input processed lazily
x = delayed(array.mean(axis=0)) # use delayed to parallelize (kind of...)
print('Mean array: {}'.format(x.compute())) 
tmp = time.time() - tp
print('Total reading and measuring time with dask: {:.2f}'.format(tmp))

我认为我缺少一些程序，似乎执行时间比古典方法要花时间。此外，我认为块选项可能是造成这种情况的原因，因为我对h5数据集和dask使用了相同的块大小。

欢迎对此程序提出任何建议。

使用dask进行公平和现实的比较

0 个答案: