我正在保存一个包含重复数据的numpy数组:
import numpy as np
import gzip
import cPickle as pkl
import h5py
a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )
f_pkl_gz = gzip.open('noise.pkl.gz', 'w')
pkl.dump(b, f_pkl_gz, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl_gz.close()
f_pkl = open('noise.pkl', 'w')
pkl.dump(b, f_pkl, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl.close()
f_hdf5 = h5py.File('noise.hdf5', 'w')
f_hdf5.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9)
f_hdf5.close()
现在列出结果
-rw-rw-r--. 1 alex alex 76962165 Oct 7 20:51 noise.hdf5
-rw-rw-r--. 1 alex alex 79992937 Oct 7 20:51 noise.pkl
-rw-rw-r--. 1 alex alex 8330136 Oct 7 20:51 noise.pkl.gz
因此,压缩程度最高的hdf5所需的空间大约是生泡菜的几率,几乎是gzipped泡菜的10倍。
有谁知道为什么会这样?我该怎么办呢?
答案 0 :(得分:5)
答案是按照@tcaswell的建议使用块。我想压缩是在每个块上单独执行的,并且块的默认大小很小,因此压缩数据中没有足够的冗余来从中受益。
以下是提供想法的代码:
import numpy as np
import gzip
import cPickle as pkl
import h5py
a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )
f_hdf5_chunk_1 = h5py.File('noise_chunk_1.hdf5', 'w')
f_hdf5_chunk_1.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1,100))
f_hdf5_chunk_1.close()
f_hdf5_chunk_10 = h5py.File('noise_chunk_10.hdf5', 'w')
f_hdf5_chunk_10.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10,100))
f_hdf5_chunk_10.close()
f_hdf5_chunk_100 = h5py.File('noise_chunk_100.hdf5', 'w')
f_hdf5_chunk_100.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (100,100))
f_hdf5_chunk_100.close()
f_hdf5_chunk_1000 = h5py.File('noise_chunk_1000.hdf5', 'w')
f_hdf5_chunk_1000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1000,100))
f_hdf5_chunk_1000.close()
f_hdf5_chunk_10000 = h5py.File('noise_chunk_10000.hdf5', 'w')
f_hdf5_chunk_10000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10000,100))
f_hdf5_chunk_10000.close()
结果:
-rw-rw-r--. 1 alex alex 8341134 Oct 7 21:53 noise_chunk_10000.hdf5
-rw-rw-r--. 1 alex alex 8416441 Oct 7 21:53 noise_chunk_1000.hdf5
-rw-rw-r--. 1 alex alex 9096936 Oct 7 21:53 noise_chunk_100.hdf5
-rw-rw-r--. 1 alex alex 16304949 Oct 7 21:53 noise_chunk_10.hdf5
-rw-rw-r--. 1 alex alex 85770613 Oct 7 21:53 noise_chunk_1.hdf5
因此,当块变小时,文件的大小会增加。