Question

考虑这个例子：

import numpy as np
a = np.array(1)
np.save("a.npy", a)

a = np.load("a.npy", mmap_mode='r')
print(type(a))

b = a + 2
print(type(b))

输出

<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>

所以似乎b不再是memmap，我认为这迫使numpy读取整个a.npy，从而违背了memmap的目的。因此，我的问题是，memmaps上的操作可以推迟到访问时间吗？

我认为子类化ndarray或memmap可以起作用，但对我的Python技能没有足够的信心去尝试它。

这是一个显示我的问题的扩展示例：

import numpy as np

# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))

# I want to print the first value using f and memmaps


def f(value):
    print(value[1])


# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)

# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)

Answer 1

这就是python的工作方式。默认情况下，numpy操作会返回一个新数组，因此server <- function(input, output) { level<- 0 # plot all polygons of the first level, which is 0 output$map <- renderLeaflet({ leaflet(data = wijk_sf) %>% #setView(lng = 4.473719, lat = 51.88956, zoom = 11) %>% addProviderTiles("Stamen.Terrain") %>% addPolygons(color = "black", fillColor = "darkgreen", fillOpacity = 0.7, label = ~GEBDNAAM, layerId = ~GEBDNAAM ) }) observe( {click = input$map_shape_click p <- input$map_shape_click if(is.null(click)){ return() }else if((p$id %in% wijk_vec) & level == 0){ level<- 1 # plot polygon level 0 here and set level to 1 }else if((p$id %in% buurt_vec) & level == 1){ level <- 2 # if level == 1, plot polygon level 1 here and set level to 2 }else if((p$id %in% buurt_vec) & level == 2){ level<- 0 # if level == 2, plot polygon level 2 here and set level to 0 }else{ level <- 0 # if all else fails, set level to 0 and plot the standard level 0 map leafletProxy('map') %>% clearShapes() %>% clearMarkers() %>% setView(lng = 4.473719, lat = 51.88956, zoom = 11) %>% addPolygons(data = df[,level]) } ) }不会作为内存映射存在-它是在b上调用+时创建的。

有两种方法可以解决此问题。最简单的是就地执行所有操作，

这需要加载用于读取和写入的内存映射数组，

a += 1

如果您不想覆盖原始数组，那么这当然没有好处。
在这种情况下，您需要指定a = np.load("a.npy", mmap_mode='r+')应该被映射。

可以使用b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)关键字provided by numpy ufuncs.

进行分配

out

Answer 2

这是ndarray子类的简单示例，该子类推迟对其进行操作，直到通过索引请求特定元素为止。
我将其包括在内是为了表明它可以完成，但是几乎可以肯定，它将以新颖和出乎意料的方式失败，并且需要大量工作才能使其可用。在非常特殊的情况下，它可能比重新设计代码以更好地解决问题要容易。建议您阅读文档中的these examples，以帮助了解其工作原理。

import numpy as np  
class Defered(np.ndarray):
      """
      An array class that deferrs calculations applied to it, only
      calculating them when an index is requested
      """
      def __new__(cls, arr):
            arr = np.asanyarray(arr).view(cls)
            arr.toApply = []
            return arr

      def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
            ## Convert all arguments to ndarray, otherwise arguments
            # of type Defered will cause infinite recursion
            # also store self as None, to be replaced later on
            newinputs = []
            for i in inputs:
                  if i is self:
                        newinputs.append(None)
                  elif isinstance(i, np.ndarray):
                        newinputs.append(i.view(np.ndarray))
                  else:
                        newinputs.append(i)

            ## Store function to apply and necessary arguments
            self.toApply.append((ufunc, method, newinputs, kwargs))
            return self

      def __getitem__(self, idx):
            ## Get index and convert to regular array
            sub = self.view(np.ndarray).__getitem__(idx)

            ## Apply stored actions
            for ufunc, method, inputs, kwargs in self.toApply:
                  inputs = [i if i is not None else sub for i in inputs]
                  sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)

            return sub

如果不使用numpy的通用函数对其进行了修改，则此操作将失败。例如percentile和median不是基于ufuncs，最终将加载整个数组。同样，如果将其传递给在数组上迭代的函数，或者将索引应用于大量对象，则整个数组将被加载。

是否可以延迟对numpy.memmap的操作？

2 个答案: