Question

是否可以保存一个numpy数组，将其附加到已存在的npy文件中，如np.save(filename,arr,mode='a')？

我有几个函数必须遍历大数组的行。由于内存限制，我无法立即创建数组。为了避免一遍又一遍地创建行，我想创建每一行一次并将其保存到文件中，并将其附加到文件中的上一行。稍后我可以在mmap_mode中加载npy文件，在需要时访问切片。

Answer 1

内置.npy文件格式非常适合处理小型数据集，而不依赖numpy以外的外部模块。

但是，当您开始拥有大量数据时，首选使用旨在处理此类数据集的文件格式（如HDF5）[1]。

例如，下面是使用PyTables在HDF5中保存numpy数组的解决方案，

第1步：创建可扩展的EArray存储空间

import tables
import numpy as np

filename = 'outarray.h5'
ROW_SIZE = 100
NUM_COLUMNS = 200

f = tables.open_file(filename, mode='w')
atom = tables.Float64Atom()

array_c = f.create_earray(f.root, 'data', atom, (0, ROW_SIZE))

for idx in range(NUM_COLUMNS):
    x = np.random.rand(1, ROW_SIZE)
    array_c.append(x)
f.close()

第2步：将行附加到现有数据集（如果需要）

f = tables.open_file(filename, mode='a')
f.root.data.append(x)

第3步：回读数据的子集

f = tables.open_file(filename, mode='r')
print(f.root.data[1:10,2:20]) # e.g. read from disk only this part of the dataset

Answer 2

要使用numpy.save将数据附加到现有文件，我们应该使用：

f_handle = file(filename, 'a')
numpy.save(f_handle, arr)
f_handle.close()

我已经检查过它在python 2.7和numpy 1.10.4

中有效

我调整了here中的代码，该代码讨论了savetxt方法。

Answer 3

.npy个文件包含标题，其中包含数组的形状和dtype。如果你知道你得到的数组是什么样的，你可以自己编写头文件，然后编写数据块。例如，这是用于连接2d矩阵的代码：

import numpy as np
import numpy.lib.format as fmt

def get_header(fnames):
    dtype = None
    shape_0 = 0
    shape_1 = None
    for i, fname in enumerate(fnames):
        m = np.load(fname, mmap_mode='r') # mmap so we read only header really fast
        if i == 0:
            dtype = m.dtype
            shape_1 = m.shape[1]
        else:
            assert m.dtype == dtype
            assert m.shape[1] == shape_1
        shape_0 += m.shape[0]
    return {'descr': fmt.dtype_to_descr(dtype), 'fortran_order': False, 'shape': (shape_0, shape_1)}

def concatenate(res_fname, input_fnames):
    header = get_header(input_fnames)
    with open(res_fname, 'wb') as f:
        fmt.write_array_header_2_0(f, header)
        for fname in input_fnames:
            m = np.load(fname)
            f.write(m.tostring('C'))

如果您需要更通用的解决方案（在附加时编辑标题），您将不得不采用[1]中的fseek技巧。

灵感来自于 [1]：https://mail.scipy.org/pipermail/numpy-discussion/2009-August/044570.html（不开箱即用）
[2]：https://docs.scipy.org/doc/numpy/neps/npy-format.html
[3]：https://github.com/numpy/numpy/blob/master/numpy/lib/format.py

Answer 4

这是Mohit Pandey回答的扩展，显示了完整的保存/加载示例。已使用Python 3.6和Numpy 1.11.3进行了测试。

from pathlib import Path
import numpy as np
import os

p = Path('temp.npy')
with p.open('ab') as f:
    np.save(f, np.zeros(2))
    np.save(f, np.ones(2))

with p.open('rb') as f:
    fsz = os.fstat(f.fileno()).st_size
    out = np.load(f)
    while f.tell() < fsz:
        out = np.vstack((out, np.load(f)))

out = array（[[0.，0.]，[1.，1。]]）

Answer 5

我制作了一个库，用于通过在零轴上追加来创建大于计算机主内存的Numpy .npy文件。然后可以使用mmap_mode="r"读取文件。

https://pypi.org/project/npy-append-array

安装：

pip install npy-append-array

示例：

from npy_append_array import NpyAppendArray
import numpy as np

arr1 = np.array([[1,2],[3,4]])
arr2 = np.array([[1,2],[3,4],[5,6]])

filename='out.npy'

# optional, .append will create file automatically if not exists
np.save(filename, arr1)

npaa = NpyAppendArray(filename)
npaa.append(arr2)
npaa.append(arr2)
npaa.append(arr2)

data = np.load(filename, mmap_mode="r")

print(data)

Answer 6

您可以尝试读取文件然后添加新数据

import numpy as np
import os.path

x = np.arange(10) #[0 1 2 3 4 5 6 7 8 9]

y = np.load("save.npy") if os.path.isfile("save.npy") else [] #get data if exist
np.save("save.npy",np.append(y,x)) #save the new

2次操作后：

print(np.load("save.npy")) #[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]

Answer 7

以下内容基于PaxRomana99的回答。它创建一个可用于保存和加载阵列的类。理想情况下，每次添加新数组以修改形状说明时，也应更改npy文件的标题（有关标题的说明，请参见here）

import numpy as np
import pickle

from pathlib import Path
import os


class npyAppendableFile():
    def __init__(self, fname, newfile=True):
        '''
        Creates a new instance of the appendable filetype
        If newfile is True, recreate the file even if already exists
        '''
        self.fname=Path(fname)
        if newfile:
            with open(self.fname, "wb") as fh:
                fh.close()
        
    def write(self, data):
        '''
        append a new array to the file
        note that this will not change the header
        '''
        with open(self.fname, "ab") as fh:
            np.save(fh, data)
            
    def load(self, axis=2):
        '''
        Load the whole file, returning all the arrays that were consecutively
        saved on top of each other
        axis defines how the arrays should be concatenated
        '''
        
        with open(self.fname, "rb") as fh:
            fsz = os.fstat(fh.fileno()).st_size
            out = np.load(fh)
            while fh.tell() < fsz:
                out = np.concatenate((out, np.load(fh)), axis=axis)
            
        return out
    
    
    def update_content(self):
        '''
        '''
        content = self.load()
        with open(self.fname, "wb") as fh:
            np.save(fh, content)

    @property
    def _dtype(self):
        return self.load().dtype

    @property
    def _actual_shape(self):
        return self.load().shape
    
    @property
    def header(self):
        '''
        Reads the header of the npy file
        '''
        with open(self.fname, "rb") as fh:
            version = np.lib.format.read_magic(fh)
            shape, fortran, dtype = np.lib.format._read_array_header(fh, version)
        
        return version, {'descr': dtype,
                         'fortran_order' : fortran,
                         'shape' : shape}
                
        
      
arr_a = np.random.rand(5,40,10)
arr_b = np.random.rand(5,40,7)    
arr_c = np.random.rand(5,40,3)    

f = npyAppendableFile("testfile.npy", True)        

f.write(arr_a)
f.write(arr_b)
f.write(arr_c)

out = f.load()

print (f.header)
print (f._actual_shape)

# after update we can load with regular np.load()
f.update_content()


new_content = np.load('testfile.npy')
print (new_content.shape)

在追加模式下保存numpy数组

7 个答案: