Question

我知道在c中，我们可以使用struct类型轻松构造复合数据集，并逐块分配数据。我目前正在Python和h5py中实现类似的结构。

import h5py
import numpy as np 

# we create a h5 file 
f = h5py.File("test.h5") # default is mode "a"


# We define a compound datatype using np.dtype
dt_type = np.dtype({"names":["image","feature"],
                   "formats":[('<f4',(4,4)),('<f4',(10,))]})

# we define our dataset with 5 instances
a = f.create_dataset("test", shape=(5,), dtype=dt_type)

要写入数据，我们可以这样做...

# "feature" array is 1D
a['feature']

输出为

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

# Write 1s to data field "feature"
a["feature"] = np.ones((5,10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)

问题是当我将2D数组“图像”写入文件时。

a["image"] = np.ones((5,4,4))

ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.

我阅读了文档并进行了研究。不幸的是，我没有找到一个好的解决方案。我知道我们使用group/dataset来模仿这种复合数据，但是我真的很想保留这种结构。有什么好方法吗？

任何帮助将不胜感激。谢谢。

Answer 1

您可以使用PyTables（又名表格）用所需的数组填充HDF5文件。您应该将每一行视为一个独立的条目（由dtype定义）。因此，“图像”数组存储为5个（4x4）ndarray，而不是单个（5x4x4）ndarray。 “功能”数组也是如此。

此示例一次将每个“功能”和“图像”数组添加一行。或者，您可以创建一个numpy记录数组，其中两个数组都包含多行数据，然后添加一个Table.append（）函数。

请参见下面的代码创建文件，然后打开只读以检查数据。

import tables as tb
import numpy as np 

# open h5 file for writing
with tb.File('test1_tb.h5','w') as h5f:

# define a compound datatype using np.dtype
    dt_type = np.dtype({"names":["feature","image"],
                        "formats":[('<f4',(10,)) , ('<f4',(4,4)) ] })

# create empty table (dataset)
    a = h5f.create_table('/', "test1", description=dt_type)

# create dataset row interator
    a_row = a.row
# create array data and append to dataset
    for i in range(5):
        a_row['feature'] = i*np.ones(10)
        a_row['image'] = np.random.random(4*4).reshape(4,4)
        a_row.append()

    a.flush()

# open h5 file read only and print contents
with tb.File('test1_tb.h5','r') as h5fr:
    a = h5fr.get_node('/','test1')
    print (a.coldtypes)
    print ('# of rows:',a.nrows)

    for row in a:
        print (row['feature'])
        print (row['image'])

Answer 2

此博客文章帮助我解决了这个问题： https://www.christopherlovell.co.uk/blog/2016/04/27/h5py-intro.html

编写复合数据集的关键代码：

import numpy as np
import h5py

# Load your dataset into numpy
audio = np.load(path.join(root_dir, 'X_dev.npy')).astype(np.float32)
text = np.load(path.join(root_dir, 'T_dev.npy')).astype(np.float32)
gesture = np.load(path.join(root_dir, 'Y_dev.npy')).astype(np.float32)

# open a hdf5 file
hf = h5py.File(root_dir+"/dev.hdf5", 'a') 

# create group
g1 = hf.create_group('dev') 

# put dataset in subgroups
g1.create_dataset('audio', data=audio)
g1.create_dataset('text', data=text)
g1.create_dataset('gesture', data=gesture)

# close the hdf5 file
hf.close()

如何使用h5py将数据写入复合数据？

2 个答案: