以此代码为例:
def get_hash(path, hash_type='md5'):
func = getattr(hashlib, hash_type)()
f = os.open(path, (os.O_RDWR | os.O_BINARY))
for block in iter(lambda: os.read(f, 1024*func.block_size), b''):
func.update(block)
os.close(f)
return func.hexdigest()
此函数返回任何文件的md5sum。假设我有一个包含30个以上文件的目录,我想在每个文件上运行散列函数:
def hasher(path=some_path):
for root, dirs, files in os.walk(path, topdown=False):
for name in files:
path = os.path.join(root, name)
yield get_hash(path)
@some_timer_decorator
... some testing function here ...
test1 took 4.684999942779541 seconds.
现在,正如您所看到的,手头的情况使我有机会“利用”hasher
函数并添加多处理:
def hasher_parallel(path=PATH):
p = multiprocessing.Pool(3)
for root, dirs, files in os.walk(path, topdown=False):
for name in files:
full_name = os.path.join(root, name)
yield p.apply_async(get_hash, (full_name,)).get()
@some_timer_decorator
... some other testing function here ...
test2 took 4.781000137329102 seconds.
输出相同。我期望并行版本更快,因为大多数文件小于<20MB且hasher
函数非常快地计算这些总和(通常,对于那个大小的文件)。我的实施有问题吗?如果它没有任何问题,是否有更快的方法解决同样的问题?
这是我用来测量执行时间的装饰器函数:
def hasher_time(f):
def f_timer(*args, **kwargs):
start = time.time()
result = f(*args, **kwargs)
end = time.time()
print(f.__name__, 'took', end - start, 'seconds')
return result
return f_timer
#
答案 0 :(得分:3)
您正在推出的工作,然后等待他们完成:
yield p.apply_async(get_hash, (full_name,)).get()
AsyncResult.get()
method 阻止,直到作业完成,您才能有效地按顺序运行作业。
收集作业,使用AsyncResult.ready()
对其进行轮询,直至完成,然后 .get()
结果。
更好的是,通过.async_apply()
调用将所有作业推入池中,然后关闭池,调用.join()
(阻止所有作业完成),然后检索.get()
的结果:
def hasher_parallel(path=PATH):
p = multiprocessing.Pool(3)
jobs = []
for root, dirs, files in os.walk(path, topdown=False):
for name in files:
full_name = os.path.join(root, name)
jobs.append(p.apply_async(get_hash, (full_name,)))
p.close()
p.join() # wait for jobs to complete
for job in jobs:
yield job.get()
您可以使用Pool.imap()
method来简化代码;它会产生结果:
def hasher_parallel(path=PATH):
p = multiprocessing.Pool(3)
filenames = (os.path.join(root, name)
for root, dirs, files in os.walk(path, topdown=False)
for name in files)
for result in p.imap(get_hash, filenames):
yield result
但是也要尝试使用 chunksize 参数和unordered variant。