Question

我有一个HDF5数据集，我读作一个numpy数组：

my_file = h5py.File(h5filename, 'r')
file_image = my_file['/image']

以及名为in的索引列表。

我想将image数据集拆分为两个独立的np.array：一个包含与in索引对应的图像，另一个包含索引不在{{1}的图像}}。图像的顺序非常重要 - 我想分割数据集，以便在每个子集中保留图像的原始顺序。我怎样才能做到这一点？

我尝试了以下内容：

in

但是，h5py给了我一个错误，指出索引必须按递增顺序排列。我不能改变索引的顺序，因为我需要图像保持原始顺序。

我的代码：

labeled_image_dataset = list(file_image[in])

Answer 1

我没有关注您选择索引的原因或方式，但正如错误所示，当您在h5文件上索引数组时，必须对索引进行排序。记住文件是串行存储，因此直接读取而不是来回读取更容易，更快捷。无论约束位于何处，在h5py或h5后端，idx都必须进行排序。

但是如果你将whold数组加载到内存（或一些连续的chunk）中，那可能需要copy，那么你可以使用一个未排序的，甚至是重复的idx列表。

换句话说，h5py数组可以像numpy数组一样编入索引，但有一些限制。

http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

Answer 2

使用in作为变量名称并不是一个好主意，因为in是一个Python关键字（例如，请参阅下面的列表推导）。为清楚起见，我将其重命名为idx

一个简单的解决方案是在标准Python for循环或列表理解中简单地遍历您的索引集：

labelled_dataset = [file_image[ii] for ii in idx]
unlabelled_dataset = [file_image[ii] for ii in range(len(file_image)) 
                      if ii not in idx]

使用矢量化索引可能会更快，例如如果要加载大量小图像补丁。在这种情况下，您可以使用np.argsort来查找将按升序排序idx的索引集，然后使用已排序的索引索引到file_image。之后，您可以通过使用与idx排序相同的索引集合索引到结果数组中来“撤消”排序idx的效果。

这是一个简化的例子来说明：

target = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
idx = np.array([5, 1, 9, 0, 3])

# find a set of indices that will sort `idx` in ascending order
order = np.argsort(idx)
ascending_idx = idx[order]

# use the sorted indices to index into the target array
ascending_result = target[ascending_idx]

# "undo" the effect of sorting `idx` by indexing the result with `order`
result = ascending_result[order]

print(np.all(result == target[idx]))
# True

从数组中排除一组索引，同时保持其原始顺序

2 个答案: