使用正则表达式检查数据集是否存在,而无需先读取所有数据集的路径

时间:2018-06-06 12:33:42

标签: python hdf5 h5py

如何在不首先阅读所有数据集的路径的情况下,使用正则表达式检查数据集是否存在?

例如,我想检查一个可能(或可能不包含)的文件中是否存在数据集'completed'

/123/completed

(假设我不知道完整路径,我只想检查数据集名称。所以this answer在我的情况下不起作用。)

2 个答案:

答案 0 :(得分:1)

自定义递归

不需要正则表达式。您可以通过递归遍历HDF5文件中的组来构建set个数据集名称:

import h5py

def traverse_datasets(hdf_file):

    """Traverse all datasets across all groups in HDF5 file."""

    def h5py_dataset_iterator(g, prefix=''):
        for key in g.keys():
            item = g[key]
            path = '{}/{}'.format(prefix, key)
            if isinstance(item, h5py.Dataset): # test for dataset
                yield (path, item)
            elif isinstance(item, h5py.Group): # test for group (go down)
                yield from h5py_dataset_iterator(item, path)

    with h5py.File(hdf_file, 'r') as f:
        for (path, dset) in h5py_dataset_iterator(f):
            yield path.split('/')[-1]

all_datasets = set(traverse_datasets('file.h5'))

然后只检查会员资格:'completed' in all_datasets

Group.visit

或者,您可以使用Group.visit。请注意,您需要使用return None的搜索功能来迭代所有组。

res = []

def searcher(name, k='completed'):
    """ Find all objects with k anywhere in the name """
    if k in name:
        res.append(name)
        return None

with h5py.File('file.h5', 'r') as f:
    group = f['/']
    group.visit(searcher)

print(res)  # print list of dataset names matching criterion

两种情况下的复杂性都是O(n)。您需要测试每个数据集的名称,但仅此而已。如果你需要一个懒惰的解决方案,第一个选项可能更好。

答案 1 :(得分:0)

递归以查找数据集的所有有效路径

以下代码使用递归查找所有数据集的有效数据路径。获得有效路径(在3次重复后终止可能的循环引用)之后,我可以对返回的列表使用正则表达式(未显示)。

Event | (...args: any) => Event

以下输出显示了“ visititems”的工作方式,或者出于我的目的,无法确定递归满足我(甚至可能是您的)需求的所有有效路径。

import numpy as np
import h5py
import collections
import warnings


def visit_data_sets(group, max_len_check=20, max_repeats=3):
    # print(group.name)
    # print(list(group.items()))

    if len(group.name) > max_len_check:
        # this section terminates a circular reference after 4 repeats. However it  will
        # incorrectly terminate  a tree if the identical repetitive sequences of names are
        # actually used in the tree.
        name_list = group.name.split('/')
        current_name = name_list[-1]
        res_list = [i for i in range(len(name_list)) if name_list[i] == current_name]
        res_deq = collections.deque(res_list)
        res_deq.rotate(1)
        res_deq2 = collections.deque(res_list)
        diff = [res_deq2[i] - res_deq[i] for i in range(0, len(res_deq))]

        if len(diff) >= max_repeats:
            if diff[-1] == diff[-2]:
                message = 'Terminating likely circular reference "{}"'.format(group.name)
                warnings.warn(message, UserWarning)
                print()
                return []

    dataset_list = list()
    for key, value in group.items():
        if isinstance(value, h5py.Dataset):
            current_path = group.name + '/{}'.format(key)
            dataset_list.append(current_path)
        elif isinstance(value, h5py.Group):
            dataset_list += visit_data_sets(value)

        else:
            print('Unhandled class name {}'.format(value.__class__.__name__))

    return dataset_list

def visit_callback(name, object):
    print('Visiting name = "{}", object name = "{}"'.format(name, object.name))
    return None

hdf_fptr = h5py.File('link_test.hdf5', mode='w')

group1 = hdf_fptr.require_group('/junk/group1')
group1a = hdf_fptr.require_group('/junk/group1/group1a')
# group1a1 = hdf_fptr.require_group('/junk/group1/group1a/group1ai')
group2 = hdf_fptr.require_group('/junk/group2')
group3 = hdf_fptr.require_group('/junk/group3')

# create a circular reference
group1ai = group1a['group1ai'] = group1


avect = np.arange(0,12.3, 1.0)

dset = group1.create_dataset('avect', data=avect)

group2['alias'] = dset
group3['alias3'] = h5py.SoftLink(dset.name)


print('\nThis demonstrates  "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"')
print('Visiting Root - {}'.format(hdf_fptr.name))
hdf_fptr.visititems(visit_callback)

print('\nThis demonstrates  "h5py visititems" visiting "group2" with a Hard Link to "avect"')
print('Visiting Group - {}'.format(group2.name))
group2.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"')
print('Visiting Group - {}'.format(group3.name))
group3.visititems(visit_callback)


print('\n\nNow demonstrate recursive visit of Root looking for datasets')
print('using the function "visit_data_sets" in this code snippet.\n')
data_paths = visit_data_sets(hdf_fptr)

for data_path in data_paths:
    print('Data Path = "{}"'.format(data_path))

hdf_fptr.close()

第一个“数据路径”结果是原始数据集。第二和第三个是循环引用对原始数据集的引用。第四个结果是硬链接,第五个结果是到原始数据集的软链接。