Question

我目前正在使用Python查找HDf5库，但遇到了一些问题。我有一个具有这种布局的数据集：

GROUP "GROUP1" {
                  DATASET "DATASET1" {
                     DATATYPE  H5T_COMPOUND {
                        H5T_STD_I64LE "DATATYPE1";
                        H5T_STD_I64LE "DATATYPE2";
                        H5T_STD_I64LE "DATATYPE3";
                     }
                     DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
                     DATA {
                     (0): {
                           1,
                           2,
                           3

我试图在数据集中进行迭代以获取与每种数据类型相关联的值，并将其复制到文本文件中。（例如，“ 1”是与“ DATATYPE1”相关的值。）以下脚本可以正常工作：

new_file  = open('newfile.txt', 'a') 
for i in range(len(dataset[...])):
 new_file.write('Ligne '+ str(i)+" "+":"+" ") 
   for j in range(len(dataset[i,...])):
     new_file.write(str(dataset[i][j]) + "\n")

但这不是很干净。所以我试图通过按名称调用数据类型来获取值。我找到的最接近的脚本如下：

for attribute in group.attrs:
    print group.attrs[attribute]

不幸的是，尽管我尝试过，但它不适用于数据类型：

检查数据类型会导致数据集

   for data.dtype in dataset.dtype:
#then print datatypes
       print dataset.dtype[data.dtype

支持错误消息是“ numpy.dtype'对象不可迭代”。 您是否知道如何处理？希望我的问题清楚。

Answer 1

没有您的数据，很难提供特定的解决方案。这是一个非常简单的示例，它使用pytables（＆numpy）模仿您的数据模式。首先，它创建HDF5文件，在组 GROUP1 下具有名为 DATASET1 的表。 DATASET1 在每个名为DATATYPE1，DATATYPE2和DATATYPE3的行中都有3个int值。 ds1.append()函数将数据行添加到表中（一次添加1行）。
创建数据后，walk_nodes()用于遍历HDF5文件结构并打印表的节点名称和dtype。

import tables as tb
import numpy as np

with tb.open_file("SO_56545586.h5", mode = "w") as h5f:

    ds1 = h5f.create_table('/GROUP1', 'DATASET1', 
                           description=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]), 
                           createparents=True)
    for row in range(5) :
        row_vals = [ (row, row+1, row*2), ]
        ds1.append(row_vals)

## This section walks the file strcuture (groups and datasets), printing node names and dtype for tables:

    for this_node in h5f.walk_nodes('/'):
        print (this_node)
        if isinstance(this_node, tb.Table) :
            print (this_node.dtype)

注意：打开现有文件时，请勿使用 mode = "w"。它将创建一个新文件（覆盖现有文件）。如果需要附加数据，请使用mode = "a"或mode = "r+"，如果只需要读取数据，请使用mode = "r"。

Answer 2

为完成kcw78添加的解决方案，我还发现此脚本也可以使用。由于无法遍历数据集，因此将数据集复制到新数组中：

dataset = file['path_to_dataset']

data = np.array(dataset) # Create a new array filled with dataset values as numpy.
print(data)  

ls_column = list(data.dtype.names) # Get a list with datatypes associated to each data values.
print(ls_column) # Show layout of datatypes associated to each previous data values. 

# Create an array filled with same datatypes rather than same subcases. 
for col in ls_column: 

    k = data[col] # example : k=data['DATATYPE1'], k=data['DATATYPE2']  
    print(k)

Answer 3

Arnaud，好的，我看到您正在使用h5py。我不明白“ 我无法遍历数据集”的意思。您可以遍历行或列/字段。这是一个使用h5py进行演示的示例。

它显示了从数据集中提取数据的4种方法，最后一种迭代）：

将整个HDF5数据集读取到np阵列
然后从该数组读取1列到另一个数组
从HDF5数据集中读取1列作为数组
循环遍历HDF5数据集列并一次读取1作为数组

请注意，从.dtype.names返回的结果是可迭代的。您无需创建列表（除非您出于其他目的需要它）。另外，HDF5支持数据集中的混合类型，因此您可以获得具有int，float和字符串值的dtype（它将是一个记录数组）。

import h5py
import numpy as np

with h5py.File("SO_56545586.h5", "w") as h5f:

    # create empty dataset 'DATASET1' in group '/GROUP1'
    # dyte argument defines names and types
    ds1 = h5f.create_dataset('/GROUP1/DATASET1', (10,), 
              dtype=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]) )

    for row in range(5) :  # load some arbitrary data into the dataset
        row_vals = [ (row, row+1, row*2), ]
        ds1[row] = row_vals

    # to read the entire dataset as an array
    ds1_arr = h5f['/GROUP1/DATASET1'][:] 
    print (ds1_arr.dtype) 

    # to read 1 column from ds1_arr as an array
    ds1_col1 = ds1_arr[:]['DATATYPE1'] 
    print ('for DATATYPE1 from ds1_arr, dtype=',ds1_col1.dtype)

    # to read 1 HDF5 dataset column as an array
    ds1_col1 = h5f['/GROUP1/DATASET1'][:,'DATATYPE1'] 
    print ('for DATATYPE1 from HDF5, dtype=',ds1_col1.dtype)

    # to loop thru HDF5 dataset columns and read 1 at a time as an array
    for col in h5f['/GROUP1/DATASET1'].dtype.names :
        print ('for ', col, ', dtype=',h5f['/GROUP1/DATASET1'][col].dtype) 
        col_arr = h5f['/GROUP1/DATASET1'][col][:]
        print (col_arr.shape)

如何迭代数据类型以获得关联的值？

3 个答案: