Question

从h5py docs，我看到我可以使用astype方法将HDF数据集转换为另一种类型的数据集。这将返回一个context Manager，它可以即时执行转换。

但是，我想读取存储为uint16的数据集，然后将其转换为float32类型。之后，我想从另一个函数中提取此数据集中的各种切片作为强制转换类型float32。文档解释了使用

with dataset.astype('float32'):
   castdata = dataset[:]

这会导致整个数据集被读入并转换为float32，这不是我想要的。我希望有一个对数据集的引用，但是转换为与float32等效的numpy.astype。如何创建对.astype('float32')对象的引用，以便将其传递给另一个函数以供使用？

一个例子：

import h5py as HDF
import numpy as np
intdata = (100*np.random.random(10)).astype('uint16')

# create the HDF dataset
def get_dataset_as_float():
    hf = HDF.File('data.h5', 'w')
    d = hf.create_dataset('data', data=intdata)
    print(d.dtype)
    # uint16

    with d.astype('float32'):
    # This won't work since the context expires. Returns a uint16 dataset reference
       return d

    # this works but causes the entire dataset to be read & converted
    # with d.astype('float32'):
    #   return d[:]

此外，似乎astype上下文仅适用于访问数据元素时。这意味着

def use_data():
   d = get_data_as_float()
   # this is a uint16 dataset

   # try to use it as a float32
   with d.astype('float32'):
       print(np.max(d))   # --> output is uint16
       print(np.max(d[:]))   # --> output is float32, but entire data is loaded

那么使用astype并不是一种简单的方式吗？

Answer 1

d.astype()会返回AstypeContext个对象。如果你查看AstypeContext的来源，你会更好地了解正在发生的事情：

class AstypeContext(object):

    def __init__(self, dset, dtype):
        self._dset = dset
        self._dtype = numpy.dtype(dtype)

    def __enter__(self):
        self._dset._local.astype = self._dtype

    def __exit__(self, *args):
        self._dset._local.astype = None

当您输入AstypeContext时，数据集的._local.astype属性会更新为新的所需类型，当您退出上下文时，它会更改回原始值。

因此，您可以或多或少地获得您正在寻找的行为：

def get_dataset_as_type(d, dtype='float32'):

    # creates a new Dataset instance that points to the same HDF5 identifier
    d_new = HDF.Dataset(d.id)

    # set the ._local.astype attribute to the desired output type
    d_new._local.astype = np.dtype(dtype)

    return d_new

当你现在从d_new阅读时，你会得到float32 numpy数组，而不是uint16：

d = hf.create_dataset('data', data=intdata)
d_new = get_dataset_as_type(d, dtype='float32')

print(d[:])
# array([81, 65, 33, 22, 67, 57, 94, 63, 89, 68], dtype=uint16)
print(d_new[:])
# array([ 81.,  65.,  33.,  22.,  67.,  57.,  94.,  63.,  89.,  68.], dtype=float32)

print(d.dtype, d_new.dtype)
# uint16, uint16

请注意，这不会更新.dtype的{{1}}属性（似乎是不可变的）。如果您还想更改d_new属性，则可能需要子类化dtype才能执行此操作。

Answer 2

astype的文档似乎暗示将其全部读入新位置是其目的。因此，如果要在单独的场合重复使用具有许多函数的float-casting，那么return d[:]是最合理的。

如果你知道你需要什么铸造而且只需要它一次，你可以改变方法并做一些事情：

def get_dataset_as_float(intdata, *funcs):
    with HDF.File('data.h5', 'w') as hf:
        d = hf.create_dataset('data', data=intdata)
        with d.astype('float32'):
            d2 = d[...]
            return tuple(f(d2) for f in funcs)

在任何情况下，您都希望确保在离开该功能之前关闭hf，否则您将在以后遇到问题。

一般情况下，我建议完全分离转换和数据集的加载/创建，并将数据集作为函数的参数之一传递。

以上可以如下调用：

In [16]: get_dataset_as_float(intdata, np.min, np.max, np.mean)
Out[16]: (9.0, 87.0, 42.299999)

使用astype在H5py中创建对HDF数据集的引用

2 个答案: