我正在做一个项目,需要我从某些文件中提取大量信息。关于项目的格式和大多数信息与我要问的内容无关。我几乎不了解如何与进程池中的所有进程共享此字典。
这是我的代码(更改了变量名并删除了大部分代码,只需要知道部分内容):
import json
import multiprocessing
from multiprocessing import Pool, Lock, Manager
import glob
import os
def record(thing, map):
with mutex:
if(thing in map):
map[thing] += 1
else:
map[thing] = 1
def getThing(file, n, map):
#do stuff
thing = file.read()
record(thing, map)
def init(l):
global mutex
mutex = l
def main():
#create a manager to manage shared dictionaries
manager = Manager()
#get the list of filenames to be analyzed
fileSet1=glob.glob("filesSet1/*")
fileSet2=glob.glob("fileSet2/*")
#create a global mutex for the processes to share
l = Lock()
map = manager.dict()
#create a process pool, give it the global mutex, and max cpu count-1 (manager is its own process)
with Pool(processes=multiprocessing.cpu_count()-1, initializer=init, initargs=(l,)) as pool:
pool.map(lambda file: getThing(file, 2, map), fileSet1) #This line is what i need help with
main()
据我了解,该lamda函数应该起作用。我需要帮助的行是:pool.map(lambda文件:getThing(file,2,map),fileSet1)。它给我一个错误。给出的错误是“ AttributeError:不能将泡菜本地对象'main ..'删除”。
任何帮助将不胜感激!
答案 0 :(得分:0)
为了并行执行任务,multiprocessing
“刺穿”任务功能。在您的情况下,此“任务功能”为lambda file: getThing(file, 2, map)
。
不幸的是,默认情况下,无法在python中腌制lambda函数(另请参见this stackoverflow post)。让我用最少的代码来说明问题:
import multiprocessing
l = range(12)
def not_a_lambda(e):
print(e)
def main():
with multiprocessing.Pool() as pool:
pool.map(not_a_lambda, l) # Case (A)
pool.map(lambda e: print(e), l) # Case (B)
main()
在情况A 中,我们有一个适当的免费功能,可以对其进行腌制,这样pool.map
操作就可以使用。在情况B 中,我们具有lambda函数,并且会发生崩溃。
一种可能的解决方案是使用适当的模块作用域函数(例如我的not_a_lambda
)。另一个解决方案是依靠第三方模块(例如dill)来扩展酸洗功能。在后一种情况下,您可以使用pathos代替常规multiprocessing
模块。最后,您可以创建一个Worker
类,将您的共享状态收集为成员。看起来可能像这样:
import multiprocessing
class Worker:
def __init__(self, mutex, map):
self.mutex = mutex
self.map = map
def __call__(self, e):
print("Hello from Worker e=%r" % (e, ))
with self.mutex:
k, v = e
self.map[k] = v
print("Goodbye from Worker e=%r" % (e, ))
def main():
manager = multiprocessing.Manager()
mutex = manager.Lock()
map = manager.dict()
# there is only ONE Worker instance which is shared across all processes
# thus, you need to make sure you don't access / modify internal state of
# the worker instance without locking the mutex.
worker = Worker(mutex, map)
with multiprocessing.Pool() as pool:
pool.map(worker, l.items())
main()