Question

是否可以在两个子集中切片h5py数据集而不将它们实际加载到内存中？ E.g：

dset = h5py.File("/2tbhd/tst.h5py","r")

X_train = dset['X'][:N/2]
X_test  = dset['X'][N/2:-1]

Answer 1

没有

您需要实现自己的类作为数据集的视图。 h5py邮件列表上的An old thread表示理论上可以使用HDF5数据空间实现这样的DatasetView类，但对于许多用例来说可能不值得。与普通的numpy数组相比，元素方式访问速度非常慢（假设您可以将数据放入内存中）。

编辑：如果您想避免弄乱HDF5数据空间（无论这意味着什么），您可能会选择更简单的方法。试试我写的this gist。像这样使用它：

dset = h5py.File("/2tbhd/tst.h5py","r")

from simpleview import SimpleView
X_view = SimpleView(dset['X'])

# Stores slices, but doesn't load into memory
X_train = X_view[:N/2]
X_test  = X_view[N/2:-1]

# These statements will load the data into memory.
print numpy.sum(X_train)
print numpy.array(X_test)[0]

请注意，此简单示例中的切片支持有些限制。如果你想要完全切片和元素访问，你必须将它复制到一个真正的数组中：

X_train_copy = numpy.array(X_train)

h5py：切片数据集而不加载到内存中

1 个答案: