Question

我正在寻找一种快速方法将hdf文件集合设置为numpy数组，其中每一行都是图像的展平版本。我的意思是：

我的hdf文件除了其他信息外还存储每帧图像。每个文件包含51帧，512x424图像。现在我有300多个hdf文件，我希望图像像素每帧存储为一个单一矢量，其中所有图像的所有帧都存储在一个numpy ndarray中。以下图片应该有助于理解：

到目前为止我得到的是一种非常慢的方法，我实际上不知道如何让它更快。问题是我的最终数组经常被调用，据我所知。因为我观察到第一个文件加载到阵列中的速度非常快，但速度快速下降。（通过打印当前hdf文件的数量来观察）

我目前的代码：

os.chdir(os.getcwd()+"\\datasets")

# predefine first row to use vstack later
numpy_data = np.ndarray((1,217088))

# search for all .hdf files
for idx, file in enumerate(glob.glob("*.hdf5")):
  f = h5py.File(file, 'r')
  # load all img data to imgs (=ndarray, but not flattened)
  imgs = f['img']['data'][:]

  # iterate over all frames (50)
  for frame in range(0, imgs.shape[0]):
    print("processing {}/{} (file/frame)".format(idx+1,frame+1))
    data = np.array(imgs[frame].flatten())
    numpy_data = np.vstack((numpy_data, data))

    # delete first row after another is one is stored
    if idx == 0 and frame == 0:
        numpy_data = np.delete(numpy_data, 0,0)

f.close()

有关详细信息，我需要这个来学习决策树。由于我的hdf文件比我的RAM大，我认为转换为numpy数组可以节省内存，因此更适合。

感谢您的每一个输入。

Answer 1

我认为你不需要迭代

imgs = f['img']['data'][:]

并重塑每个2d数组。重塑整个事情。如果我理解你的描述是正确的，imgs是一个3d数组：（51,512,424）

imgs.reshape(51, 512*424)

应该是2d等价物。

如果必须循环，请不要使用vstack（或某些变体来构建更大的数组）。一，它很慢，两个清理最初的'虚拟'条目很痛苦。使用列表追加，并在末尾进行一次堆叠

alist = []
for frame....
   alist.append(data)
data_array = np.vstack(alist)

vstack（和family）将数组列表作为输入，因此它可以同时使用多个数组。迭代完成时列表追加更快。

我怀疑将东西放入一个数组会有所帮助。我不确切知道hdf5文件的大小与下载数组的大小有何关系，但我预计它们的数量级相同。因此，尝试将所有300个文件加载到内存中可能无效。那是什么，像素的3G？

对于单个文件，h5py可以加载一个太大而无法容纳在内存中的数组块。这表明问题往往是另一种方式，文件不仅适用。

Is it possible to load large data directly into numpy int8 array using h5py?

Answer 2

你真的不想将所有图像加载到RAM而不是使用单个HDF5文件吗？如果您没有犯任何错误（无法识别的花式索引，不正确的块大小），访问HDF5文件可能会非常快。如果你不喜欢这种可能性：

os.chdir(os.getcwd()+"\\datasets")
img_per_file=51

# get all HDF5-Files
files=[]
for idx, file in enumerate(glob.glob("*.hdf5")):
    files.append(file)

# allocate memory for your final Array (change the datatype if your images have some other type)
numpy_data=np.empty((len(files)*img_per_file,217088),dtype=np.uint8)

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    numpy_data[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

将数据写入单个HDF5文件非常相似：

f_out=h5py.File(File_Name_HDF5_out,'w')
# create the dataset (change the datatype if your images have some other type)
dset_out = f_out.create_dataset(Dataset_Name_out, ((len(files)*img_per_file,217088), chunks=(1,217088),dtype='uint8')

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    dset_out[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

f_out.close()

如果您之后只想访问整个图像，那么块大小应该没问题。如果不是，你必须根据需要改变它。

访问HDF5文件时应该执行的操作：

使用符合您需求的块大小。
设置合适的chunk-chache-size。这可以使用h5py低级别api或h5py_cache来完成。 https://pypi.python.org/pypi/h5py-cache/1.0

避免使用任何类型的花式索引。如果您的数据集具有n维，则以返回的数组也具有n维的方式访问它。

# Chunk size is [50,50] and we iterate over the first dimension
numpyArray=h5_dset[i,:] #slow
numpyArray=np.squeeze(h5_dset[i:i+1,:]) #does the same but is much faster

修改这显示了如何将数据读取到memmaped numpy数组。我认为你的方法需要格式为np.float32的数据。 https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap

numpy_data = np.memmap('Your_Data.npy', dtype='np.float32', mode='w+', shape=((len(files)*img_per_file,217088)))

其他一切都可以保持不变。如果它有效，我还建议使用SSD而不是硬盘。

hdf到ndarray numpy - 快速的方式

2 个答案: