如何在不首先阅读所有数据集的路径的情况下,使用正则表达式检查数据集是否存在?
例如,我想检查一个可能(或可能不包含)的文件中是否存在数据集'completed'
/123/completed
(假设我不知道完整路径,我只想检查数据集名称。所以this answer在我的情况下不起作用。)
答案 0 :(得分:1)
不需要正则表达式。您可以通过递归遍历HDF5文件中的组来构建set
个数据集名称:
import h5py
def traverse_datasets(hdf_file):
"""Traverse all datasets across all groups in HDF5 file."""
def h5py_dataset_iterator(g, prefix=''):
for key in g.keys():
item = g[key]
path = '{}/{}'.format(prefix, key)
if isinstance(item, h5py.Dataset): # test for dataset
yield (path, item)
elif isinstance(item, h5py.Group): # test for group (go down)
yield from h5py_dataset_iterator(item, path)
with h5py.File(hdf_file, 'r') as f:
for (path, dset) in h5py_dataset_iterator(f):
yield path.split('/')[-1]
all_datasets = set(traverse_datasets('file.h5'))
然后只检查会员资格:'completed' in all_datasets
。
或者,您可以使用Group.visit
。请注意,您需要使用return None
的搜索功能来迭代所有组。
res = []
def searcher(name, k='completed'):
""" Find all objects with k anywhere in the name """
if k in name:
res.append(name)
return None
with h5py.File('file.h5', 'r') as f:
group = f['/']
group.visit(searcher)
print(res) # print list of dataset names matching criterion
两种情况下的复杂性都是O(n)。您需要测试每个数据集的名称,但仅此而已。如果你需要一个懒惰的解决方案,第一个选项可能更好。
答案 1 :(得分:0)
以下代码使用递归查找所有数据集的有效数据路径。获得有效路径(在3次重复后终止可能的循环引用)之后,我可以对返回的列表使用正则表达式(未显示)。
Event | (...args: any) => Event
以下输出显示了“ visititems”的工作方式,或者出于我的目的,无法确定递归满足我(甚至可能是您的)需求的所有有效路径。
import numpy as np
import h5py
import collections
import warnings
def visit_data_sets(group, max_len_check=20, max_repeats=3):
# print(group.name)
# print(list(group.items()))
if len(group.name) > max_len_check:
# this section terminates a circular reference after 4 repeats. However it will
# incorrectly terminate a tree if the identical repetitive sequences of names are
# actually used in the tree.
name_list = group.name.split('/')
current_name = name_list[-1]
res_list = [i for i in range(len(name_list)) if name_list[i] == current_name]
res_deq = collections.deque(res_list)
res_deq.rotate(1)
res_deq2 = collections.deque(res_list)
diff = [res_deq2[i] - res_deq[i] for i in range(0, len(res_deq))]
if len(diff) >= max_repeats:
if diff[-1] == diff[-2]:
message = 'Terminating likely circular reference "{}"'.format(group.name)
warnings.warn(message, UserWarning)
print()
return []
dataset_list = list()
for key, value in group.items():
if isinstance(value, h5py.Dataset):
current_path = group.name + '/{}'.format(key)
dataset_list.append(current_path)
elif isinstance(value, h5py.Group):
dataset_list += visit_data_sets(value)
else:
print('Unhandled class name {}'.format(value.__class__.__name__))
return dataset_list
def visit_callback(name, object):
print('Visiting name = "{}", object name = "{}"'.format(name, object.name))
return None
hdf_fptr = h5py.File('link_test.hdf5', mode='w')
group1 = hdf_fptr.require_group('/junk/group1')
group1a = hdf_fptr.require_group('/junk/group1/group1a')
# group1a1 = hdf_fptr.require_group('/junk/group1/group1a/group1ai')
group2 = hdf_fptr.require_group('/junk/group2')
group3 = hdf_fptr.require_group('/junk/group3')
# create a circular reference
group1ai = group1a['group1ai'] = group1
avect = np.arange(0,12.3, 1.0)
dset = group1.create_dataset('avect', data=avect)
group2['alias'] = dset
group3['alias3'] = h5py.SoftLink(dset.name)
print('\nThis demonstrates "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"')
print('Visiting Root - {}'.format(hdf_fptr.name))
hdf_fptr.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group2" with a Hard Link to "avect"')
print('Visiting Group - {}'.format(group2.name))
group2.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"')
print('Visiting Group - {}'.format(group3.name))
group3.visititems(visit_callback)
print('\n\nNow demonstrate recursive visit of Root looking for datasets')
print('using the function "visit_data_sets" in this code snippet.\n')
data_paths = visit_data_sets(hdf_fptr)
for data_path in data_paths:
print('Data Path = "{}"'.format(data_path))
hdf_fptr.close()
第一个“数据路径”结果是原始数据集。第二和第三个是循环引用对原始数据集的引用。第四个结果是硬链接,第五个结果是到原始数据集的软链接。