使用dask.bag与普通python列表?

时间:2018-09-11 19:56:15

标签: python parallel-processing dask

当我在下面运行此并行dask.bag代码时,与顺序Python代码相比,我的计算速度似乎慢得多。为什么有任何见解?

import dask.bag as db

def is_even(x):
    return not x % 2

快捷代码:

%%timeit
b = db.from_sequence(range(2000000))
c = b.filter(is_even).map(lambda x: x ** 2)
c.compute() 

>>> 12.8 s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With n = 8000000
>>> 50.7 s ± 2.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python代码:

%%timeit
b = list(range(2000000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 2, b))

>>> 547 ms ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With n = 8000000
>>> 2.25 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1 个答案:

答案 0 :(得分:1)

感谢@abarnert建议您通过更长的任务时间来查看开销。

似乎每个任务的时间都太短,而且开销使Dask变慢了。我将指数从2更改为10000,以延长每个任务的时间。这个例子产生了我所期望的:

Python代码:

%%timeit
b = list(range(50000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 10000, b))

>>> 34.8 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

快捷代码:

%%timeit
b = db.from_sequence(range(50000))
c = b.filter(is_even).map(lambda x: x ** 10000)
c.compute()

>>> 26.4 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)