Question

给出一个字符串列表列表，例如：

test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]

我想用h5py存储它，以便：

f['test_dataset'][0] = ['a1','a2']
f['test_dataset'][0][0] = 'a1'
etc.

遵循主题中的建议 H5py store list of list of strings ，我尝试了以下内容：

import h5py
test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
with h5py.File('test.h5','w') as f:
    string_dt = h5py.special_dtype(vlen=str)
    f.create_dataset('test_dataset',data=test_array,dtype=string_dt)

然而，这会导致每个嵌套列表都存储为字符串，即：

f['test_dataset'][0] = "['a1', 'a2']"
f['test_dataset'][0][0] = '['

如果使用h5py或任何其他基于hdf5的库无法做到这一点，我很乐意听到其他可能用于存储数据的格式/库的建议。

我的数据由多维numpy整数数组和嵌套的字符串列表组成，如上例所示，大约有> 100M行和~8列。

谢谢！

Answer 1

在Saving with h5py arrays of different sizes

中

我建议将可变长度数组列表保存为多个数据集。

In [19]: f = h5py.File('test.h5','w')
In [20]: g = f.create_group('test_array')
In [21]: test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
In [22]: string_dt = h5py.special_dtype(vlen=str)
In [23]: for i,v in enumerate(test_array):
    ...:     g.create_dataset(str(i), data=np.array(v,'S4'), dtype=string_dt)
    ...:     
In [24]: for k in g.keys():
    ...:     print(k,g[k][:])
    ...:     
0 ['a1' 'a2']
1 ['b1']
2 ['c1' 'c2' 'c3' 'c4']

对于许多小的名单，这可能会很混乱，但我不确定它是否有效。

＆＃39;平坦化＆＃39;使用列表连接可能有效

In [27]: list1 =[', '.join(x) for x in test_array]
In [28]: list1
Out[28]: ['a1, a2', 'b1', 'c1, c2, c3, c4']
In [30]: '\n'.join(list1)
Out[30]: 'a1, a2\nb1\nc1, c2, c3, c4'

可以使用少量split重新创建嵌套列表。

另一个想法 - 对字符串进行腌制并保存。

来自h5py介绍

An HDF5 file is a container for two kinds of objects: datasets, which
are array-like collections of data, and groups, which are folder-like
containers that hold datasets and other groups. The most fundamental
thing to remember when using h5py is:

Groups work like dictionaries, and datasets work like NumPy arrays

pickle无法正常工作

In [32]: import pickle
In [33]: pickle.dumps(test_array)
Out[33]: b'\x80\x03]q\x00(]q\x01(X\x02\x00\x00\x00a1q\x02X\x02\x00\x00\x00a2q\x03e]q\x04X\x02\x00\x00\x00b1q\x05a]q\x06(X\x02\x00\x00\x00c1q\x07X\x02\x00\x00\x00c2q\x08X\x02\x00\x00\x00c3q\tX\x02\x00\x00\x00c4q\nee.'
In [34]: f.create_dataset('pickled', data=pickle.dumps(test_array), dtype=string
    ...: _dt)
....
ValueError: VLEN strings do not support embedded NULLs

JSON

In [35]: import json
In [36]: json.dumps(test_array)
Out[36]: '[["a1", "a2"], ["b1"], ["c1", "c2", "c3", "c4"]]'
In [37]: f.create_dataset('pickled', data=json.dumps(test_array), dtype=string_d
    ...: t)
Out[37]: <HDF5 dataset "pickled": shape (), type "|O">
In [43]: json.loads(f['pickled'].value)
Out[43]: [['a1', 'a2'], ['b1'], ['c1', 'c2', 'c3', 'c4']]

Answer 2

难看的解决方法

hf.create_dataset('test', data=repr(test_array))

h5py：存储字符串列表的列表

2 个答案:

JSON