Question

我有一个大的HDF5文件（~30GB），我需要在每个数据集中随机输入（沿0轴）。通过h5py文档查看我无法找到randomAccess或shuffle功能，但我希望我错过了一些内容。

是否有人熟悉HDF5，想到一种快速随机播放数据的方法？

这是我用我有限的知识实现的伪代码：

for dataset in datasets:
    unshuffled = range(dataset.dims[0])
    while unshuffled.length != 0:
        if unshuffled.length <= 100:
            dataset[:unshuffled.length/2], dataset[unshuffled.length/2:] = dataset[unshuffled.length/2:], dataset[:unshuffled.length/2]
            break
        else:
            randomIndex1 = rand(unshuffled.length - 100)
            randomIndex2 = rand(unshuffled.length - 100)

            unshuffled.removeRange(randomIndex1..<randomIndex1+100)
            unshuffled.removeRange(randomIndex2..<randomIndex2+100)

            dataset[randomIndex1:randomIndex1 + 100], dataset[randomIndex2:randomIndex2 + 100] = dataset[randomIndex2:randomIndex2 + 100], dataset[randomIndex1:randomIndex1 + 100]

Answer 1

您可以使用>>> import os >>> import random >>> import time >>> import h5py >>> import numpy as np >>> >>> h5f = h5py.File('example.h5', 'w') >>> h5f.create_dataset('example', (40000, 256, 256, 3), dtype='float32') >>> # set all values of each instance equal to its index ... for i, instance in enumerate(h5f['example']): ... h5f['example'][i, ...] = \ ... np.ones(instance.shape, dtype='float32') * i ... >>> # get file size in bytes ... file_size = os.path.getsize('example.h5') >>> print('Size of example.h5: {:.3f} GB'.format(file_size/2.0**30)) Size of example.h5: 29.297 GB >>> def shuffle_time(): ... t1 = time.time() ... random.shuffle(h5f['example']) ... t2 = time.time() ... print('Time to shuffle: {:.3f} seconds'.format(str(t2 - t1))) ... >>> print('Value of first 5 instances:\n{}' ... ''.format(str(h5f['example'][:10, 0, 0, 0]))) Value of first 5 instances: [ 0. 1. 2. 3. 4.] >>> shuffle_time() Time to shuffle: 673.848 seconds >>> print('Value of first 5 instances after ' ... 'shuffling:\n{}'.format(str(h5f['example'][:10, 0, 0, 0]))) Value of first 5 instances after shuffling: [ 15733. 28530. 4234. 14869. 10267.] >>> h5f.close()。对于配备Core i5处理器，8 GB RAM和256 GB SSD的笔记本电脑上的30 GB数据集，这需要11分钟多一点。请参阅以下内容：

{{1}}

改组几个较小数据集的性能不应该比这更差。

Answer 2

这是我的解决方案 picture

输入

def shuffle(*datas):
    import random
    for d in datas:
        random.seed(666)
        random.shuffle(d)
a = list(range(6))
b = list(range(6))
c = list(range(6))
shuffle(a,b,c)
a,b,c

输出

([2, 0, 1, 4, 5, 3], [2, 0, 1, 4, 5, 3], [2, 0, 1, 4, 5, 3])

输入

os.chdir("/usr/local/dataset/flicker25k/")
file = h5py.File("./FLICKR-25K.h5","r+")
print(os.path.getsize("./FLICKR-25K.h5"))
images = file['images']
labels = file['LAll']
tags = file['YAll']
shuffle(images,tags,labels)

输出

executed in 27.9s, finished 22:49:53 2019-05-21
3320572656

使用h5py随机播放HDF5数据集

2 个答案: