给出一个字符串列表列表,例如:
test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
我想用h5py存储它,以便:
f['test_dataset'][0] = ['a1','a2']
f['test_dataset'][0][0] = 'a1'
etc.
遵循主题中的建议 H5py store list of list of strings ,我尝试了以下内容:
import h5py
test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
with h5py.File('test.h5','w') as f:
string_dt = h5py.special_dtype(vlen=str)
f.create_dataset('test_dataset',data=test_array,dtype=string_dt)
然而,这会导致每个嵌套列表都存储为字符串,即:
f['test_dataset'][0] = "['a1', 'a2']"
f['test_dataset'][0][0] = '['
如果使用h5py或任何其他基于hdf5的库无法做到这一点,我很乐意听到其他可能用于存储数据的格式/库的建议。
我的数据由多维numpy整数数组和嵌套的字符串列表组成,如上例所示,大约有> 100M行和~8列。
谢谢!
答案 0 :(得分:1)
在Saving with h5py arrays of different sizes
中我建议将可变长度数组列表保存为多个数据集。
In [19]: f = h5py.File('test.h5','w')
In [20]: g = f.create_group('test_array')
In [21]: test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
In [22]: string_dt = h5py.special_dtype(vlen=str)
In [23]: for i,v in enumerate(test_array):
...: g.create_dataset(str(i), data=np.array(v,'S4'), dtype=string_dt)
...:
In [24]: for k in g.keys():
...: print(k,g[k][:])
...:
0 ['a1' 'a2']
1 ['b1']
2 ['c1' 'c2' 'c3' 'c4']
对于许多小的名单,这可能会很混乱,但我不确定它是否有效。
'平坦化'使用列表连接可能有效
In [27]: list1 =[', '.join(x) for x in test_array]
In [28]: list1
Out[28]: ['a1, a2', 'b1', 'c1, c2, c3, c4']
In [30]: '\n'.join(list1)
Out[30]: 'a1, a2\nb1\nc1, c2, c3, c4'
可以使用少量split
重新创建嵌套列表。
另一个想法 - 对字符串进行腌制并保存。
来自h5py
介绍
An HDF5 file is a container for two kinds of objects: datasets, which
are array-like collections of data, and groups, which are folder-like
containers that hold datasets and other groups. The most fundamental
thing to remember when using h5py is:
Groups work like dictionaries, and datasets work like NumPy arrays
pickle
无法正常工作
In [32]: import pickle
In [33]: pickle.dumps(test_array)
Out[33]: b'\x80\x03]q\x00(]q\x01(X\x02\x00\x00\x00a1q\x02X\x02\x00\x00\x00a2q\x03e]q\x04X\x02\x00\x00\x00b1q\x05a]q\x06(X\x02\x00\x00\x00c1q\x07X\x02\x00\x00\x00c2q\x08X\x02\x00\x00\x00c3q\tX\x02\x00\x00\x00c4q\nee.'
In [34]: f.create_dataset('pickled', data=pickle.dumps(test_array), dtype=string
...: _dt)
....
ValueError: VLEN strings do not support embedded NULLs
In [35]: import json
In [36]: json.dumps(test_array)
Out[36]: '[["a1", "a2"], ["b1"], ["c1", "c2", "c3", "c4"]]'
In [37]: f.create_dataset('pickled', data=json.dumps(test_array), dtype=string_d
...: t)
Out[37]: <HDF5 dataset "pickled": shape (), type "|O">
In [43]: json.loads(f['pickled'].value)
Out[43]: [['a1', 'a2'], ['b1'], ['c1', 'c2', 'c3', 'c4']]
答案 1 :(得分:0)
难看的解决方法
hf.create_dataset('test', data=repr(test_array))