Question

是否可以先创建.npy文件而不先在内存中分配相应的数组？

我需要创建并使用大型numpy数组，这个数组太大而无法在内存中创建。 Numpy支持内存映射，但据我所知，我的选项是：

使用numpy.memmap创建一个memmapped文件。这会直接在磁盘上创建文件而不分配内存，但不存储元数据，所以当我稍后重新映射文件时，我需要知道它的dtype，形状等。在下面，请注意不指定形状结果在memmap中被解释为平面数组：
```
In [77]: x=memmap('/tmp/x', int, 'w+', shape=(3,3))


In [78]: x
Out[78]: 
memmap([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])


In [79]: y=memmap('/tmp/x', int, 'r')


In [80]: y
Out[80]: memmap([0, 0, 0, 0, 0, 0, 0, 0, 0])
```
在内存中创建一个数组，使用numpy.save保存，之后可以在memmapped模式下加载。这会将元数据与磁盘上的数组数据一起记录，但需要至少为整个阵列分配一次内存。

Answer 1

我有同样的问题，当我读到Sven的回复时感到很失望。似乎numpy会丢失一些关键功能，如果你不能在文件中有一个巨大的数组并且一次处理它的一小部分。您的案例似乎接近于制作.npy格式的原始理性中的一个用例（请参阅：http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt）。

然后我遇到了numpy.lib.format，这似乎是完全有用的好东西。我不知道为什么numpy root包中没有这个功能。与HDF5相比的关键优势在于它具有numpy。

>>> print numpy.lib.format.open_memmap.__doc__

"""
Open a .npy file as a memory-mapped array.

This may be used to read an existing file or create a new one.

Parameters
----------
filename : str
    The name of the file on disk. This may not be a filelike object.
mode : str, optional
    The mode to open the file with. In addition to the standard file modes,
    'c' is also accepted to mean "copy on write". See `numpy.memmap` for
    the available mode strings.
dtype : dtype, optional
    The data type of the array if we are creating a new file in "write"
    mode.
shape : tuple of int, optional
    The shape of the array if we are creating a new file in "write"
    mode.
fortran_order : bool, optional
    Whether the array should be Fortran-contiguous (True) or
    C-contiguous (False) if we are creating a new file in "write" mode.
version : tuple of int (major, minor)
    If the mode is a "write" mode, then this is the version of the file
    format used to create the file.

Returns
-------
marray : numpy.memmap
    The memory-mapped array.

Raises
------
ValueError
    If the data or the mode is invalid.
IOError
    If the file is not found or cannot be opened correctly.

See Also
--------
numpy.memmap
"""

Answer 2

正如您自己发现的那样，NumPy主要针对处理内存中的数据。处理磁盘上的数据有不同的库，今天最常用的可能是HDF5。我建议看一下h5py，它是HDF5库的优秀Python包装器。它旨在与NumPy一起使用，如果您已经了解NumPy，它的界面很容易学习。要了解它如何解决您的问题，请阅读documentation of Datasets。

为了完整起见，我应该提到PyTables，这似乎是在Python中处理大型数据集的“标准”方式。我没有使用它，因为h5py对我更有吸引力。两个库都有FAQ条目，用于定义另一个库的范围。

如何在磁盘上创建一个numpy .npy文件？

2 个答案: