我正在学习python多处理,我正在尝试使用此功能来填充包含操作系统中所有文件的列表。但是,我编写的代码只是按顺序执行。
#!/usr/bin/python
import os
import multiprocessing
tld = [os.path.join("/", f) for f in os.walk("/").next()[1]] #Gets a top level directory names inside "/"
manager = multiprocessing.Manager()
files = manager.list()
def get_files(x):
for root, dir, file in os.walk(x):
for name in file:
files.append(os.path.join(root, name))
mp = [multiprocessing.Process(target=get_files, args=(tld[x],))
for x in range(len(tld))]
for i in mp:
i.start()
i.join()
print len(files)
当我检查进程树时,我只能看到产生的一个智能进程。 (man pstree说{}表示父母产生的子进程。)
---bash(10949)---python(12729)-+-python(12730)---{python}(12752)
`-python(12750)`
我正在寻找的是,为每个tld目录生成一个进程,填充共享列表files
,这将是大约10-15个进程,具体取决于目录的数量。我做错了什么?
编辑::
我用multiprocessing.Pool
来创建工作线程,这次是
生成的进程,但在我尝试使用multiprocessing.Pool.map()
时出错。我指的是显示
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
在该示例之后,我将代码重写为
import os
import multiprocessing
tld = [os.path.join("/", f) for f in os.walk("/").next()[1]]
manager = multiprocessing.Manager()
pool = multiprocessing.Pool(processes=len(tld))
print pool
files = manager.list()
def get_files(x):
for root, dir, file in os.walk(x):
for name in file:
files.append(os.path.join(root, name))
pool.map(get_files, [x for x in tld])
pool.close()
pool.join()
print len(files)
它正在分支多个流程。
---bash(10949)---python(12890)-+-python(12967)
|-python(12968)
|-python(12970)
|-python(12971)
|-python(12972)
---snip---
但代码错误说
Process PoolWorker-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
task = get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
return recv()
AttributeError: 'module' object has no attribute 'get_files'
self._target(*self._args, **self._kwargs)
self.run()
task = get()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
task = get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
AttributeError: 'module' object has no attribute 'get_files'
self.run()
我在这里做错了什么,以及为什么get_files()函数出错?
答案 0 :(得分:4)
这只是因为您在定义函数get_files
之前实例化了池:
import os
import multiprocessing
tld = [os.path.join("/", f) for f in os.walk("/").next()[1]]
manager = multiprocessing.Manager()
files = manager.list()
def get_files(x):
for root, dir, file in os.walk(x):
for name in file:
files.append(os.path.join(root, name))
pool = multiprocessing.Pool(processes=len(tld)) # Instantiate the pool here
pool.map(get_files, [x for x in tld])
pool.close()
pool.join()
print len(files)
流程的总体思路是,在您启动它的瞬间,您将分叉主流程的内存。所以在主进程 中完成的任何定义 后,fork将不在子进程中。
如果你想要一个共享内存,可以使用threading
库,但是你会遇到一些问题(参见:The global interpreter lock)