Question

我会在Numpy个文件中存储许多arrays npz，这些文件是使用savez_compressed函数保存的。

我在许多数组中拆分信息，因为如果没有，我使用的函数会因内存问题而崩溃。数据不稀疏。

我需要将所有信息联合在一个唯一的数组中（以便能够使用某些例程处理它），并将其存储到磁盘中（使用diffente参数进行多次处理）。

数组不适合RAM +交换内存。

如何将它们合并到一个唯一的数组并将其保存到磁盘？

我怀疑我应该使用mmap_mode，但我没有意识到具体如何。另外，我想如果我最初没有保留连续的磁盘空间，可能会出现一些性能问题。

我已阅读this帖子，但我仍然无法实现如何操作。

修改

澄清：我已经制作了许多函数来处理类似的数据，其中一些需要一个数组作为参数。在某些情况下，我可以通过使用切片将它们仅传递给这个大数组的一部分。但拥有所有信息仍然很重要。在这样的阵列中。

这是因为以下原因：数组包含订购时间的信息（来自物理模拟）。在函数的参数中，用户可以设置要处理的初始和最后时间。此外，他/她可以设置处理块的大小（这很重要，因为这会影响性能但允许块大小取决于计算资源）。因此，我无法将数据存储为单独的块。

构建这个特定数组（我试图创建的数组）的方式并不重要。

Answer 1

这是一个如何将90GB的易压缩数据写入磁盘的示例。这里提到了最重要的一点https://stackoverflow.com/a/48405220/4045774

正常硬盘驱动器的写入/读取速度应在（300 MB / s，500 MB / s）范围内。

示例

import numpy as np import tables #register blosc import h5py as h5 import h5py_cache as h5c import time def read_the_arrays(): #Easily compressable data #A lot smaller than your actual array, I do not have that much RAM return np.arange(10*int(15E3)).reshape(10,int(15E3)) def writing(hdf5_path): # As we are writing whole chunks here this isn't realy needed, # if you forget to set a large enough chunk-cache-size when not writing or reading # whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole) f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size dset = f.create_dataset("your_data", shape=(int(15E5),int(15E3)),dtype=np.float32,chunks=(10000,100),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False) #Lets write to the dataset for i in range(0,int(15E5),10): dset[i:i+10,:]=read_the_arrays() f.close() def reading(hdf5_path): f = h5c.File(hdf5_path, 'r',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size dset = f["your_data"] #Read chunks for i in range(0,int(15E3),10): data=np.copy(dset[:,i:i+10]) f.close() hdf5_path='Test.h5' t1=time.time() writing(hdf5_path) print(time.time()-t1) t1=time.time() reading(hdf5_path) print(time.time()-t1)

Answer 2

您应该能够在np.memap阵列上按块加载块：

import numpy as np

data_files = ['file1.npz', 'file2.npz2', ...]

# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        rows += chunk.shape[0]
        cols = chunk.shape[1]
        dtype = chunk.dtype

# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        merged[idx:idx + len(chunk)] = chunk
        idx += len(chunk)

然而，正如评论中指出的那样，跨越一个不是最快的维度的评论会非常缓慢。

如何合并非常大的numpy数组？

2 个答案: