我正在尝试并行化Python字典的子集。下面的代码基于在列表positions_sub
中是否找到positions
字典中的键,创建了一个新的字典node_list
:
positions_sub = {}
for k,v in positions.items():
if k in node_list:
positions_sub[k] = v
此代码可以正常工作,并且完全符合我的要求。但是,它需要一段时间才能运行,因此我正在尝试使其并行化。我试图在下面的代码中执行此操作,但是它返回positions_sub
作为字典列表,这不是我想要的。每个键的值数量也存在一些问题。任何想法如何使它工作?谢谢!
from joblib import Parallel, delayed
def dict_filter(k,v):
if k in node_list:
positions_sub[k] = v
return positions_sub
positions_sub = Parallel(n_jobs=-1,)(delayed(dict_filter)(k,v)for k,v in positions.items())
答案 0 :(得分:1)
在采用并行化之前,应确保为每个任务使用正确的数据结构:请记住,x in list
本质上是O(n)
,而x in set
(还有{{1} })更像x in dict
。因此,仅将O(1)
转换为node_list
可以极大地提高性能。
set
要考虑的另一件事是node_list = set(node_list)
positions_sub = {}
for k,v in positions.items():
if k in node_list:
positions_sub[k] = v
与len(positions)
之间的比率。如果一个比另一个小得多,则应始终遍历较小的那个。
编辑:一些用于性能比较的代码
len(node_list)
EDIT2 :根据要求,这是我的计算机上的一些结果
import random
import timeit
import functools
def generate(n_positions=1000, n_node_list=100):
positions = { i:i for i in random.sample(range(n_positions), n_positions) }
node_list = random.sample(range(max(n_positions, n_node_list)), n_node_list)
return positions, node_list
def validate(variant):
data = generate(1000, 100)
if sorted(data[1]) != sorted(k for k in variant(*data)):
raise Exception(f"{variant.__name__} failed")
def measure(variant, data, repeats=1000):
total_seconds = timeit.Timer(functools.partial(variant, *data)).timeit(repeats)
average_ms = total_seconds / repeats * 1000
print(f"{variant.__name__:10s} took an average of {average_ms:0.2f}ms per pass over {repeats} passes" )
def variant1(positions, node_list):
positions_sub = {}
for k,v in positions.items():
if k in node_list:
positions_sub[k] = v
return positions_sub
def variant1b(positions, node_list):
node_list = set(node_list)
positions_sub = {}
for k,v in positions.items():
if k in node_list:
positions_sub[k] = v
return positions_sub
def variant2(positions, node_list):
return {k:v for k,v in positions.items() if k in node_list}
def variant2b(positions, node_list):
node_list = set(node_list)
return {k:v for k,v in positions.items() if k in node_list}
def variant3(positions, node_list):
return {k:positions[k] for k in node_list if k in positions}
if __name__ == "__main__":
variants = [variant1,variant1b,variant2,variant2b,variant3]
for variant in variants:
validate(variant)
n_positions = 4000
n_node_list = 1000
n_repeats = 100
data = generate(n_node_list, n_node_list)
print(f"data generated with len(positions)={n_positions} and len(node_list)={n_node_list}")
for variant in variants:
measure(variant, data, n_repeats)
请注意,已选择first run:
data generated with len(positions)=4000 and len(node_list)=1000
variant1 took an average of 6.90ms per pass over 100 passes
variant1b took an average of 0.22ms per pass over 100 passes
variant2 took an average of 6.95ms per pass over 100 passes
variant2b took an average of 0.12ms per pass over 100 passes
variant3 took an average of 0.19ms per pass over 100 passes
second run:
data generated with len(positions)=40000 and len(node_list)=10000
variant1 took an average of 738.23ms per pass over 10 passes
variant1b took an average of 2.04ms per pass over 10 passes
variant2 took an average of 739.51ms per pass over 10 passes
variant2b took an average of 1.52ms per pass over 10 passes
variant3 took an average of 1.85ms per pass over 10 passes
和n=len(positions)
,以使比率m=len(node_list)
大致等于OP为{{1 }}和n/m=4
的300K。
从第一次运行到第二次运行,请注意放大10倍的效果:在第一次运行中,variant1b的速度是Variant1的31倍左右,而在第二次运行中,variant1b的速度快361倍!这是将n
的复杂度从O(m)降低到O(1)的预期结果。变量1的总时间复杂度为n * m = 0.25 * n ^ 2 = O(n ^ 2),而变量1b仅具有n * 1 = O(n)。这意味着n每增加一个数量级,variant1b的速度也比variant1快一个数量级。
仅通过并行化就可以实现类似的性能改进是非常令人怀疑的,因为令人尴尬的可并行化问题的预期性能增益通常是可用CPU的倍数,这仍然是一个不变的因素,而且远远没有提高算法从O(n ^ 2)改进为O(n)的方法。
此外,虽然恕我直言,给定的问题属于令人尴尬的可并行化问题的类别,但是在并行处理之后,必须对输出进行汇总才能使用。此外,我不熟悉joblib,这就是为什么我跳过了将它添加到比较中的原因。