Question

我正在尝试使用包含复合对象的一维数组的 h5py (Python 3) 设置并写入 HDF5 数据集。每个复合对象由三个可变长度的字符串属性组成。

     with h5py.File("myfile.hdf5", "a") as file:
         dt = np.dtype([
             ("label", h5py.string_dtype(encoding='utf-8')),
             ("name", h5py.string_dtype(encoding='utf-8')),
             ("id", h5py.string_dtype(encoding='utf-8'))])
         dset = file.require_dataset("initial_data", (50000,), dtype=dt)
         dset[0, "label"] = "foo"

当我运行上面的例子时，最后一行代码导致 h5py（或更准确地说是 numpy）抛出一个错误说：

<块引用>

“无法更改对象数组的数据类型。”

我是否正确理解 "foo" 的类型不是 h5py.string_dtype(encoding='utf-8')？

怎么会？我该如何解决这个问题？

更新1：进入堆栈跟踪，我可以看到错误是从一个名为 _view_is_safe(oldtype, newtype) 的内部 numpy 函数抛出的。就我而言，oldtype 是 dtype('O') 但 newtype 是 dtype([('label', 'O')])，这会导致抛出错误。

更新2：我的问题已在下面成功回答，但为了完整起见，我将链接到可能相关的 GH 问题：https://github.com/h5py/h5py/issues/1921

Answer 1

您将 dtype 设置为可变长度字符串的元组，因此您需要一次性设置所有元组。仅设置标签元素，其他两个元组值未设置，因此它们不是字符串类型。

示例：

import h5py
import numpy as np

with h5py.File("myfile.hdf5", "a") as file:
    dt = np.dtype([
        ("label", h5py.string_dtype(encoding='utf-8')),
        ("name", h5py.string_dtype(encoding='utf-8')),
        ("id", h5py.string_dtype(encoding='utf-8'))])
    dset = file.require_dataset("initial_data", (50000,), dtype=dt)

#Add a row of data with a tuple:
    dset[0] = "foo", "bar", "baz"
 
#Add another row of data with a np recarray (1 row):
    npdt = np.dtype([
        ("label", 'S4'),
        ("name", 'S4'),
        ("id", 'S4') ])
    dset[1] = np.array( ("foo1", "bar1", "baz1"), dtype=npdt )
       
#Add 3 rows of data with a np recarray (3 rows built from a list of arrays):
    s1 = np.array( ("A", "B", "C"), dtype='S4' )
    s2 = np.array( ("a", "b", "c"), dtype='S4' )
    s3 = np.array( ("X", "Y", "Z"), dtype='S4' )
    recarr = np.rec.fromarrays([s1, s2, s3], dtype=npdt)
    dset[2:5] = recarr

结果 #1：

使用所有 3 种方法的结果：

为什么在向数据集添加 3 个可变长度字符串时 h5py 会抛出错误？

1 个答案: