Question

我有一个复杂的python对象，内存大小约为36GB，我想在多个单独的python进程之间共享。它作为pickle文件存储在磁盘上，我目前为每个进程单独加载。我希望共享此对象，以便在可用内存量的情况下并行执行更多进程。

在某种意义上，此对象用作只读数据库。每个进程每秒启动多个访问请求，每个请求仅适用于一小部分数据。

我研究了像Radis这样的解决方案，但我最终看到，数据需要序列化为一个简单的文本形式。此外，将pickle文件本身映射到内存应该没有帮助，因为它需要被每个进程提取。所以我想到了另外两个可能的解决方案：

使用共享内存，每个进程都可以访问存储对象的地址。这里的问题是该过程只会看到大量的字节，无法解释
编写一个包含此对象的代码，并通过API调用管理数据检索。在这里，我想知道这种解决方案在速度方面的表现。

有没有一种简单的方法可以实现这些解决方案？也许这种情况有更好的解决方案？

非常感谢！

Answer 1

对于复杂对象，没有可用的方法直接在进程之间共享内存。如果你有简单的ctypes，你可以在c风格的共享内存中执行此操作，但它不会直接映射到python对象。

如果您在任何时候只需要一部分数据，而不是整个36GB，那么有一个简单的解决方案可以正常运行。为此，您可以使用SyncManager中的multiprocessing.managers。使用此功能，您可以设置一个服务器，为您的数据提供代理类（您的数据不会存储在类中，代理只提供对它的访问）。然后，您的客户端使用BaseManager连接到服务器，并调用代理类中的方法来检索数据。

在幕后，Manager类负责挑选您要求的数据并通过开放端口从服务器发送到客户端。因为如果您需要整个数据集，所以每次调用都会对数据进行腌制，这并不高效。在您只需要客户端中的一小部分数据的情况下，该方法节省了大量时间，因为服务器只需要加载一次数据。

该解决方案可以快速地与数据库解决方案相媲美，但如果您更愿意保留纯粹的pythonic解决方案，它可以为您节省大量复杂性和数据库学习。

这里有一些示例代码，用于处理GloVe字向量。

服务器

#!/usr/bin/python import sys from multiprocessing.managers import SyncManager import numpy # Global for storing the data to be served gVectors = {} # Proxy class to be shared with different processes # Don't but the big vector data in here since that will force it to # be piped to the other process when instantiated there, instead just # return the global vector data, from this process, when requested. class GloVeProxy(object): def __init__(self): pass def getNVectors(self): global gVectors return len(gVectors) def getEmpty(self): global gVectors return numpy.zeros_like(gVectors.values()[0]) def getVector(self, word, default=None): global gVectors return gVectors.get(word, default) # Class to encapsulate the server functionality class GloVeServer(object): def __init__(self, port, fname): self.port = port self.load(fname) # Load the vectors into gVectors (global) @staticmethod def load(filename): global gVectors f = open(filename, 'r') for line in f: vals = line.rstrip().split(' ') gVectors[vals[0]] = numpy.array(vals[1:]).astype('float32') # Run the server def run(self): class myManager(SyncManager): pass myManager.register('GloVeProxy', GloVeProxy) mgr = myManager(address=('', self.port), authkey='GloVeProxy01') server = mgr.get_server() server.serve_forever() if __name__ == '__main__': port = 5010 fname = '/mnt/raid/Data/Misc/GloVe/WikiGiga/glove.6B.50d.txt' print 'Loading vector data' gs = GloVeServer(port, fname) print 'Serving data. Press <ctrl>-c to stop.' gs.run()

<强>客户端

from multiprocessing.managers import BaseManager import psutil #3rd party module for process info (not strictly required) # Grab the shared proxy class. All methods in that class will be availble here class GloVeClient(object): def __init__(self, port): assert self._checkForProcess('GloVeServer.py'), 'Must have GloVeServer running' class myManager(BaseManager): pass myManager.register('GloVeProxy') self.mgr = myManager(address=('localhost', port), authkey='GloVeProxy01') self.mgr.connect() self.glove = self.mgr.GloVeProxy() # Return the instance of the proxy class @staticmethod def getGloVe(port): return GloVeClient(port).glove # Verify the server is running @staticmethod def _checkForProcess(name): for proc in psutil.process_iter(): if proc.name() == name: return True return False if __name__ == '__main__': port = 5010 glove = GloVeClient.getGloVe(port) for word in ['test', 'cat', '123456']: print('%s = %s' % (word, glove.getVector(word)))

请注意，psutil库仅用于检查您是否正在运行服务器，但不是必需的。请务必为服务器命名GloVeServer.py或更改代码中的psutil检查，以便查找正确的名称。

在不同进程之间在内存中共享复杂的python对象

1 个答案: