如何延迟加载数据结构(python)

时间:2010-12-29 23:26:04

标签: python lazy-loading

我有一些构建数据结构的方法(例如,某些文件内容):

def loadfile(FILE):
    return # some data structure created from the contents of FILE

所以我可以做像

这样的事情
puppies = loadfile("puppies.csv") # wait for loadfile to work
kitties = loadfile("kitties.csv") # wait some more
print len(puppies)
print puppies[32]

在上面的例子中,我浪费了大量时间实际阅读kitties.csv并创建了一个我从未使用过的数据结构。每当我想做某事时,我都不想经常检查if not kitties,以避免浪费。我希望能够做到

puppies = lazyload("puppies.csv") # instant
kitties = lazyload("kitties.csv") # instant
print len(puppies)                # wait for loadfile
print puppies[32]

因此,如果我不尝试使用kitties执行任何操作,loadfile("kitties.csv")永远不会被调用。

有没有一些标准方法可以做到这一点?

在玩了一下之后,我制作了以下解决方案,它似乎工作正常并且非常简短。还有其他选择吗?我应该记住使用这种方法有缺点吗?

class lazyload:
    def __init__(self,FILE):
        self.FILE = FILE
        self.F = None
    def __getattr__(self,name):
        if not self.F: 
            print "loading %s" % self.FILE
            self.F = loadfile(self.FILE)
        return object.__getattribute__(self.F, name)

如果这样的事情奏效,那可能会更好:

class lazyload:
    def __init__(self,FILE):
        self.FILE = FILE
    def __getattr__(self,name):
        self = loadfile(self.FILE) # this never gets called again
                                   # since self is no longer a
                                   # lazyload instance
        return object.__getattribute__(self, name)

但这不起作用,因为self是本地的。实际上,每当你做任何事情时,它都会调用loadfile

6 个答案:

答案 0 :(得分:4)

Python stdlibrary中的csv模块在你开始迭代之前不会加载数据,所以它实际上是懒惰的。

编辑:如果您需要通读整个文件来构建数据结构,那么拥有代理事物的复杂Lazy加载对象就会过度。就这样做:

class Lazywrapper(object):
    def __init__(self, filename):
        self.filename = filename
        self._data = None

    def get_data(self):
        if self._data = None:
            self._build_data()
        return self._data

    def _build_data(self):
        # Now open and iterate over the file to build a datastructure, and
        # put that datastructure as self._data

通过上述课程,你可以这样做:

puppies = Lazywrapper("puppies.csv") # Instant
kitties = Lazywrapper("kitties.csv") # Instant

print len(puppies.getdata()) # Wait
print puppies.getdata()[32] # instant

另外

allkitties = kitties.get_data() # wait
print len(allkitties)
print kitties[32]

如果你有一个很多的数据,并且你真的不需要加载所有数据,那么你也可以实现类似于读取文件的类,直到它找到名为“Froufrou”的小狗“然后停止,但在那时,最好将数据一次性地保存在数据库中并从那里访问它。

答案 1 :(得分:2)

如果你真的担心if语句,你有一个有状态对象。

from collections import MutableMapping
class LazyLoad( MutableMapping ):
   def __init__( self, source ):
       self.source= source
       self.process= LoadMe( self )
       self.data= None
   def __getitem__( self, key ):
       self.process= self.process.load()
       return self.data[key]
   def __setitem__( self, key, value ):
       self.process= self.process.load()
       self.data[key]= value
   def __contains__( self, key ):
       self.process= self.process.load()
       return key in self.data

此类将工作委托给process对象,该对象可以是Load或。{ DoneLoading对象。 Load对象实际上会加载。 DoneLoading 不会加载。

请注意,没有if语句。

class LoadMe( object ):
   def __init__( self, parent ):
       self.parent= parent
   def load( self ):
       ## Actually load, setting self.parent.data
       return DoneLoading( self.parent )

class DoneLoading( object ):
   def __init__( self, parent ):
       self.parent= parent
   def load( self ):
       return self

答案 2 :(得分:1)

不会if not self.F导致另一次调用__getattr__,让你陷入无限循环?我认为你的方法是有道理的,但为了安全起见,我会把这条线放到:

if name == "F" and not self.F:

另外,你可以在类上创建一个方法loadfile,具体取决于你正在做什么。

答案 3 :(得分:1)

这是一个使用类装饰器推迟初始化直到第一次使用对象的解决方案:

def lazyload(cls):
    original_init = cls.__init__
    original_getattribute = cls.__getattribute__

    def newinit(self, *args, **kwargs):
        # Just cache the arguments for the eventual initialization.
        self._init_args = args
        self._init_kwargs = kwargs
        self.initialized = False
    newinit.__doc__ = original_init.__doc__

    def performinit(self):
        # We call object's __getattribute__ rather than super(...).__getattribute__
        # or original_getattribute so that no custom __getattribute__ implementations
        # can interfere with what we are doing.
        original_init(self,
                      *object.__getattribute__(self, "_init_args"),
                      **object.__getattribute__(self, "_init_kwargs"))
        del self._init_args
        del self._init_kwargs
        self.initialized = True

    def newgetattribute(self, name):
        if not object.__getattribute__(self, "initialized"):
            performinit(self)
        return original_getattribute(self, name)

    if hasattr(cls, "__getitem__"):
        original_getitem = cls.__getitem__
        def newgetitem(self, key):
            if not object.__getattribute__(self, "initialized"):
                performinit(self)
            return original_getitem(self, key)
        newgetitem.__doc__ = original_getitem.__doc__
        cls.__getitem__ = newgetitem

    if hasattr(cls, "__len__"):
        original_len = cls.__len__
        def newlen(self):
            if not object.__getattribute__(self, "initialized"):
                performinit(self)
            return original_len(self)
        newlen.__doc__ = original_len.__doc__
        cls.__len__ = newlen

    cls.__init__ = newinit
    cls.__getattribute__ = newgetattribute
    return cls

@lazyload
class FileLoader(dict):
    def __init__(self, filename):
        self.filename = filename
        print "Performing expensive load operation"
        self[32] = "Felix"
        self[33] = "Eeek"

kittens = FileLoader("kitties.csv")
print "kittens is instance of FileLoader: %s" % isinstance(kittens, FileLoader) # Well obviously
print len(kittens) # Wait
print kittens[32] # No wait
print kittens[33] # No wait
print kittens.filename # Still no wait
print kittens.filename

输出:

kittens is instance of FileLoader: True
Performing expensive load operation
2
Felix
Eeek
kitties.csv
kitties.csv

我尝试在初始化后实际恢复原始的魔术方法,但它没有用完。可能需要代理其他魔术方法,我没有调查每个场景。

请注意,kittens.initialized将始终返回True,因为如果尚未执行初始化,它将启动初始化。显然,可以为此属性添加一个豁免,以便在没有对该对象执行其他操作时返回False,或者可以将检查更改为等同于hasattr调用,并且可以在删除之后删除初始化属性初始化。

答案 4 :(得分:0)

这是一个让“更好”的解决方案工作的黑客,但我认为这很麻烦,只使用第一个解决方案可能更好。我们的想法是通过将变量名称作为属性传递来执行步骤self = loadfile(self.FILE)

class lazyload:
    def __init__(self,FILE,var):
        self.FILE = FILE
        self.var  = var
    def __getattr__(self,name):
        x = loadfile(self.FILE)
        globals()[self.var]=x
        return object.__getattribute__(x, name)

然后你可以做

kitties = lazyload("kitties.csv","kitties")
   ^                                 ^
    \                               /
     These two better match exactly

kitties上调用任何方法后(kitties.FILEkitties.var除外),它将与您获得的完全无法区分 kitties = loadfile("kitties.csv")。特别是,它将不再是lazyloadkitties.FILE的实例,而kitties.var将不再存在。

答案 5 :(得分:0)

如果您需要使用puppies[32],还需要定义__getitem__方法,因为__getattr__无法捕获该行为。

我为我的需求实现延迟加载,有不适应的代码:

class lazy_mask(object):
  '''Fake object, which is substituted in
  place of masked object'''

  def __init__(self, master, id):
    self.master=master
    self.id=id
    self._result=None
    self.master.add(self)

  def _res(self):
    '''Run lazy job'''
    if not self._result:
      self._result=self.master.get(self.id)
    return self._result

  def __getattribute__(self, name):
    '''proxy all queries to masked object'''
    name=name.replace('_lazy_mask', '')
    #print 'attr', name
    if name in ['_result', '_res', 'master', 'id']:#don't proxy requests for own properties
      return super(lazy_mask, self).__getattribute__(name)
    else:#but proxy requests for masked object
      return self._res().__getattribute__(name)

  def __getitem__(self, key):
    '''provide object["key"] access. Else can raise 
    TypeError: 'lazy_mask' object is unsubscriptable'''
    return self._res().__getitem__(key)

(master是我在运行get()方法时加载数据的注册表对象)

这个实现适用于isinstance()和str()以及json.dumps()吗