我写了一个multiprocessing
计数器,并将其与原生collections.Counter
进行了比较。
为什么我的多处理计数器比collections.Counter慢?
[multi-count.py]:
import io
from collections import Counter
from multiprocessing import Process, Manager, Lock
import random
import time
class MultiProcCounter(object):
def __init__(self):
self.dictionary = Manager().dict()
self.lock = Lock()
def increment(self, item):
with self.lock:
self.dictionary[item] = self.dictionary.get(item, 0) + 1
def func(counter, item):
counter.increment(item)
def multiproc_count(inputs):
counter = MultiProcCounter()
procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
for p in procs: p.start()
for p in procs: p.join()
return counter.dictionary
inputs = [random.randint(1,10) for _ in range(1000)]
start = time.time()
print (multiproc_count(inputs))
print (time.time() - start)
start = time.time()
print (Counter(inputs))
print (time.time() - start)
[OUT]:
{1: 88, 2: 95, 3: 99, 4: 98, 5: 102, 6: 111, 7: 99, 8: 103, 9: 97, 10: 108}
4.128664016723633
Counter({6: 111, 10: 108, 8: 103, 5: 102, 3: 99, 7: 99, 4: 98, 9: 97, 2: 95, 1: 88})
0.0006728172302246094
我用Python3运行它:
$ ulimit -n 2048
$ python3 multi-count.py
为了使任务更难,我将输入增加到10000,我得到一个OSError:
File "multi-count.py", line 29, in <module>
print (multiproc_count(inputs))
File "multi-count.py", line 23, in multiproc_count
Process Process-2043:
for p in procs: p.start()
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 105, in start
Traceback (most recent call last):
self._popen = self._Popen(self)
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/managers.py", line 709, in _callmethod
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 93, in run
File "bpe-multi.py", line 18, in func
File "bpe-multi.py", line 15, in increment
File "<string>", line 2, in get
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/managers.py", line 713, in _callmethod
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/managers.py", line 700, in _connect
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 487, in Client
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 612, in SocketClient
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/socket.py", line 134, in __init__
OSError: [Errno 24] Too many open files
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files
我无法在笔记本电脑上增加ulimit
:
$ ulimit -n 4096
-bash: ulimit: open files: cannot modify limit: Operation not permitted
使用multiprocesing.Pool
:
import io
from collections import Counter
from multiprocessing import Process, Manager, Lock, Pool
import random
import time
def func(counter, x):
counter[x] = counter.get(x, 0) + 1
inputs = [random.randint(1,10) for _ in range(10000)]
manager = Manager()
counter = manager.dict()
pool = Pool(4)
for x in inputs:
pool.apply_async(func, [counter, x])
pool.close()
pool.join()
print counter
[OUT]:
$ time python multi-count.py
{1: 978, 2: 978, 3: 997, 4: 982, 5: 958, 6: 1033, 7: 1044, 8: 1008, 9: 1007, 10: 1004}
real 0m16.187s
user 0m18.817s
sys 0m14.055s
使用原生collections.Counter
:
$ time python3 -c 'import random; from collections import Counter; inputs = [random.randint(1,10) for _ in range(10000)]; print (Counter(inputs))'
Counter({6: 1067, 4: 1048, 3: 1021, 5: 1010, 9: 992, 7: 985, 8: 983, 1: 969, 2: 964, 10: 961})
real 0m0.099s
user 0m0.059s
sys 0m0.018s
$ time python3 -c 'import random; from collections import Counter; inputs = [random.randint(1,10) for _ in range(100000)]; print (Counter(inputs))'
Counter({9: 10159, 10: 10114, 8: 10046, 3: 10028, 7: 9998, 6: 9994, 2: 9982, 4: 9951, 1: 9898, 5: 9830})
real 0m0.236s
user 0m0.206s
sys 0m0.016s