我有一个类型的大对象,无法在进程之间共享。它有方法来实例化它并处理它的数据。
我正在做的当前方式是我首先在主父进程中实例化对象,然后在发生某些事件时将其传递给子进程。问题是每当子进程运行时,它们每次都会在内存中复制对象,这需要一段时间。我想将它存储在只有他们可用的内存中,这样他们每次调用该对象的函数时都不必复制它。
我如何仅为该过程自己使用存储对象?
编辑:代码
class MultiQ:
def __init__(self):
self.pred = instantiate_predict() #here I instantiate the big object
def enq_essay(self,essay):
p = Process(target=self.compute_results, args=(essay,))
p.start()
def compute_results(self, essay):
predictions = self.pred.predict_fields(essay) #computation in the large object that doesn't modify the object
每次都会在内存中复制大对象。我试图避免这种情况。
编辑4:在20个新闻组数据上运行的短代码示例
import sklearn.feature_extraction.text as ftext
import sklearn.linear_model as lm
import multiprocessing as mp
import logging
import os
import numpy as np
import cPickle as pickle
def get_20newsgroups_fnames():
all_files = []
for i, (root, dirs, files) in enumerate(os.walk("/home/roman/Desktop/20_newsgroups/")):
if i>0:
all_files.extend([os.path.join(root,file) for file in files])
return all_files
documents = [unicode(open(f).read(), errors="ignore") for f in get_20newsgroups_fnames()]
logger = mp.get_logger()
formatter = logging.Formatter('%(asctime)s: [%(processName)12s] %(message)s',
datefmt = '%H:%M:%S')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.WARNING)
mp._log_to_stderr = True
def free_memory():
"""
Return free memory available, including buffer and cached memory
"""
total = 0
with open('/proc/meminfo', 'r') as f:
for line in f:
line = line.strip()
if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
field, amount, unit = line.split()
amount = int(amount)
if unit != 'kB':
raise ValueError(
'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
total += amount
return total
def predict(large_object, essay="this essay will be predicted"):
"""this method copies the large object in memory which is what im trying to avoid"""
vectorized_essay = large_object[0].transform(essay)
large_object[1].predict(vectorized_essay)
report_memory("done")
def train_and_model():
"""this is very similar to the instantiate_predict method from my first code sample"""
tfidf_vect = ftext.TfidfVectorizer()
X = tfidf_vect.fit_transform(documents)
y = np.random.random_integers(0,1,19997)
model = lm.LogisticRegression()
model.fit(X, y)
return (tfidf_vect, model)
def report_memory(label):
f = free_memory()
logger.warn('{l:<25}: {f}'.format(f=f, l=label))
def dump_large_object(large_object):
f = open("large_object.obj", "w")
pickle.dump(large_object, f, protocol=2)
f.close()
def load_large_object():
f = open("large_object.obj")
large_object = pickle.load(f)
f.close()
return large_object
if __name__ == '__main__':
report_memory('Initial')
tfidf_vect, model = train_and_model()
report_memory('After train_and_model')
large_object = (tfidf_vect, model)
procs = [mp.Process(target=predict, args=(large_object,))
for i in range(mp.cpu_count())]
report_memory('After Process')
for p in procs:
p.start()
report_memory('After p.start')
for p in procs:
p.join()
report_memory('After p.join')
输出1:
19:01:39: [ MainProcess] Initial : 26585728
19:01:51: [ MainProcess] After train_and_model : 25958924
19:01:51: [ MainProcess] After Process : 25958924
19:01:51: [ MainProcess] After p.start : 25925908
19:01:51: [ Process-1] done : 25725524
19:01:51: [ Process-2] done : 25781076
19:01:51: [ Process-4] done : 25789880
19:01:51: [ Process-3] done : 25802032
19:01:51: [ MainProcess] After p.join : 25958272
roman@ubx64:$ du -h large_object.obj
4.6M large_object.obj
所以也许大对象甚至不大,我的问题在于tfidf vectorizer的transform方法的内存使用。
现在如果我将main方法更改为:
report_memory('Initial')
large_object = load_large_object()
report_memory('After loading the object')
procs = [mp.Process(target=predict, args=(large_object,))
for i in range(mp.cpu_count())]
report_memory('After Process')
for p in procs:
p.start()
report_memory('After p.start')
for p in procs:
p.join()
report_memory('After p.join')
我得到了这些结果: 输出2:
20:07:23: [ MainProcess] Initial : 26578356
20:07:23: [ MainProcess] After loading the object : 26544380
20:07:23: [ MainProcess] After Process : 26544380
20:07:23: [ MainProcess] After p.start : 26523268
20:07:24: [ Process-1] done : 26338012
20:07:24: [ Process-4] done : 26337268
20:07:24: [ Process-3] done : 26439444
20:07:24: [ Process-2] done : 26438948
20:07:24: [ MainProcess] After p.join : 26542860
然后我将main方法更改为:
report_memory('Initial')
large_object = load_large_object()
report_memory('After loading the object')
predict(large_object)
report_memory('After Process')
得到了这些结果: 输出3:
20:13:34: [ MainProcess] Initial : 26572580
20:13:35: [ MainProcess] After loading the object : 26538356
20:13:35: [ MainProcess] done : 26513804
20:13:35: [ MainProcess] After Process : 26513804
此时我不知道发生了什么,但多处理肯定会占用更多内存。
答案 0 :(得分:2)
Linux使用copy-on-write,这意味着当子进程被分叉时, 每个子进程中的全局变量共享相同的内存地址,直到 值被修改。只有在修改了值时才会复制它。
所以从理论上讲,如果没有修改大对象,可以使用它 子进程没有消耗更多的内存。让我们测试一下这个理论。
这是你的代码,通过一些内存使用记录进行了修改:
import sklearn.feature_extraction.text as ftext
import sklearn.linear_model as lm
import multiprocessing as mp
import logging
logger = mp.get_logger()
formatter = logging.Formatter('%(asctime)s: [%(processName)12s] %(message)s',
datefmt='%H:%M:%S')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.WARNING)
mp._log_to_stderr = True
def predict(essay="this essay will be predicted"):
"""this method copies the large object in memory which is what im trying to avoid"""
vectorized_essay = large_object[0].transform(essay)
large_object[1].predict(vectorized_essay)
report_memory("done")
def train_and_model():
"""this is very similar to the instantiate_predict method from my first code sample"""
tfidf_vect = ftext.TfidfVectorizer()
N = 100000
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?', ] * N
y = [1, 0, 1, 0] * N
report_memory('Before fit_transform')
X = tfidf_vect.fit_transform(corpus)
model = lm.LogisticRegression()
model.fit(X, y)
report_memory('After model.fit')
return (tfidf_vect, model)
def free_memory():
"""
Return free memory available, including buffer and cached memory
"""
total = 0
with open('/proc/meminfo', 'r') as f:
for line in f:
line = line.strip()
if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
field, amount, unit = line.split()
amount = int(amount)
if unit != 'kB':
raise ValueError(
'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
total += amount
return total
def gen_change_in_memory():
f = free_memory()
diff = 0
while True:
yield diff
f2 = free_memory()
diff = f - f2
f = f2
change_in_memory = gen_change_in_memory().next
def report_memory(label):
logger.warn('{l:<25}: {d:+d}'.format(d=change_in_memory(), l=label))
if __name__ == '__main__':
report_memory('Initial')
tfidf_vect, model = train_and_model()
report_memory('After train_and_model')
large_object = (tfidf_vect, model)
procs = [mp.Process(target=predict) for i in range(mp.cpu_count())]
report_memory('After Process')
for p in procs:
p.start()
for p in procs:
p.join()
report_memory('After p.join')
它产生:
21:45:01: [ MainProcess] Initial : +0
21:45:01: [ MainProcess] Before fit_transform : +3224
21:45:12: [ MainProcess] After model.fit : +153572
21:45:12: [ MainProcess] After train_and_model : -3100
21:45:12: [ MainProcess] After Process : +0
21:45:12: [ Process-1] done : +2232
21:45:12: [ Process-2] done : +2976
21:45:12: [ Process-3] done : +3596
21:45:12: [ Process-4] done : +3224
21:45:12: [ MainProcess] After p.join : -372
报告的数字是可用内存的KiB变化(包括缓存和缓存)
缓冲区)。因此,例如,“初始”和“初始”之间的空闲内存的变化
'train_and_model'后约为150MB。因此,large_object
需要约
150MB。
然后,在完成4个子过程后,内存量会少得多 -
总共约12MB - 已被消耗。消耗的内存可能是由于
创建子进程加上transform
使用的内存
predict
方法。
所以似乎large_object
没有被复制,因为我们是
本应该看到内存消耗增加约150MB。
您在20个新闻组上的评论:
以下是可用内存的变化:
在20个新闻组数据上:
| Initial | 0 |
| After train_and_model | 626804 | <-- Large object requires 627M
| After Process | 0 |
| After p.start | 33016 |
| done | 200384 |
| done | -55552 |
| done | -8804 |
| done | -12152 |
| After p.join | -156240 |
所以看起来实例化大对象需要627MB。
在达到第一个done
之后,为什么要消耗额外的200 + MB,我一无所知。
使用load_large_object:
| Initial | 0 |
| After loading the object | 33976 |
| After Process | 0 |
| After p.start | 21112 |
| done | 185256 |
| done | 744 |
| done | -102176 |
| done | 496 |
| After p.join | -103912 |
显然,large_object本身只需要34MB,内存的其余部分627-34 = 593MB必须由fit_transform
中调用的fit
和train_and_model
方法使用。< / p>
使用单一流程:
| Initial | 0 |
| After loading the object | 34224 |
| done | 24552 |
| After Process | 0 |
这似乎是合理的。
因此,您积累的数据似乎支持声称大型对象本身未被每个子进程复制。但是出现了一个新的谜团:为什么两者之间会有巨大的内存消耗 “在p.start之后”和第一个“完成”。我不知道答案。
您可以尝试围绕
进行report_memory
来电
vectorized_essay = large_object[0].transform(essay)
和
large_object[1].predict(vectorized_essay)
查看额外内存的消耗位置。我的猜测是这些scikit-learn方法之一就是选择分配这个(相对)大量的内存。
答案 1 :(得分:0)
我最终使用Rabbit MQ使用RPC服务器。Rabbit MQ Tutorial for RPC/Python。所以我创建了相当于我机器上CPU数量的服务器数量。这些服务器启动一次并为模型和矢量化器分配内存一次,并在运行时保持它。其他优点是
总的来说,我的代码也变得更加清晰。