Question

我正在使用一堆大的numpy数组，并且由于这些数据最近开始咀嚼太多内存，我想用numpy.memmap实例替换它们。问题是，我现在必须调整阵列的大小，我最好这样做。这对于普通数组非常有效，但是在memmaps上尝试这样做会抱怨，数据可能会被共享，甚至禁用重新检查也无济于事。

a = np.arange(10)
a.resize(20)
a
>>> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

a = np.memmap('bla.bin', dtype=int)
a
>>> memmap([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

a.resize(20, refcheck=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-f1546111a7a1> in <module>()
----> 1 a.resize(20, refcheck=False)

ValueError: cannot resize this array: it does not own its data

调整底层mmap缓冲区的大小非常合适。问题是如何将这些更改反映到数组对象。我已经看到了这个workaround，但不幸的是它没有调整阵列的大小。还有一些numpy documentation关于调整mmaps的大小，但它显然不起作用，至少在版本1.8.0。还有其他想法，如何覆盖内置的大小调整检查？

Answer 1

问题是创建阵列时标志OWNDATA为False。您可以通过在创建数组时要求标志为True来更改它：

>>> a = np.require(np.memmap('bla.bin', dtype=int), requirements=['O'])
>>> a.shape
(10,)
>>> a.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
>>> a.resize(20, refcheck=False)
>>> a.shape
(20,)

唯一需要注意的是，它可以创建数组并制作副本以确保满足要求。

编辑以解决保存问题：

如果要将重新调整大小的数组保存到磁盘，可以将memmap保存为.npy格式的文件，并在需要重新打开并用作memmap时以numpy.memmap打开：

>>> a[9] = 1
>>> np.save('bla.npy',a)
>>> b = np.lib.format.open_memmap('bla.npy', dtype=int, mode='r+')
>>> b
memmap([0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

编辑以提供其他方法：

您可以通过重新调整基本mmap（a.base或a._mmap，以uint8格式存储）并“重新加载”memmap来接近您所寻找的内容：

>>> a = np.memmap('bla.bin', dtype=int)
>>> a
memmap([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> a[3] = 7
>>> a
memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0])
>>> a.flush()
>>> a = np.memmap('bla.bin', dtype=int)
>>> a
memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0])
>>> a.base.resize(20*8)
>>> a.flush()
>>> a = np.memmap('bla.bin', dtype=int)
>>> a
memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Answer 2

如果我没有弄错的话，这实际上取决于@ wwwslinger的第二个解决方案，但无需手动指定新memmap的大小：

In [1]: a = np.memmap('bla.bin', mode='w+', dtype=int, shape=(10,))

In [2]: a[3] = 7

In [3]: a
Out[3]: memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0])

In [4]: a.flush()

# this will append to the original file as much as is necessary to satisfy
# the new shape requirement, given the specified dtype
In [5]: new_a = np.memmap('bla.bin', mode='r+', dtype=int, shape=(20,))

In [6]: new_a
Out[6]: memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [7]: a[-1] = 10

In [8]: a
Out[8]: memmap([ 0,  0,  0,  7,  0,  0,  0,  0,  0, 10])

In [9]: a.flush()

In [11]: new_a
Out[11]: 
memmap([ 0,  0,  0,  7,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0])

当新数组需要比旧数组大时，这种方法很有效，但我不认为这种类型的方法会允许在新数组较小时自动截断内存映射文件的大小

在@ wwwslinger的答案中，手动调整基数大小似乎允许截断文件，但它不会减小数组的大小。

例如：

# this creates a memory mapped file of 10 * 8 = 80 bytes
In [1]: a = np.memmap('bla.bin', mode='w+', dtype=int, shape=(10,))

In [2]: a[:] = range(1, 11)

In [3]: a.flush()

In [4]: a
Out[4]: memmap([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

# now truncate the file to 40 bytes
In [5]: a.base.resize(5*8)

In [6]: a.flush()

# the array still has the same shape, but the truncated part is all zeros
In [7]: a
Out[7]: memmap([1, 2, 3, 4, 5, 0, 0, 0, 0, 0])

In [8]: b = np.memmap('bla.bin', mode='r+', dtype=int, shape=(5,))

# you still need to create a new np.memmap to change the size of the array
In [9]: b
Out[9]: memmap([1, 2, 3, 4, 5])

调整numpy.memmap数组的大小

2 个答案: