Question

大家好，我能告诉我为什么dask数组上的If语句这么慢以及如何解决吗？

import dask.array as da
import time

x = da.random.binomial(1, 0.5, 200, 200)
s = time.time()
if da.any(x):
    e = time.time()
    print('duration = ', e-s)

output: duration =  0.368

Answer 1

默认情况下，Dask数组是惰性的，因此直到您在数组上调用.compute()之前，任何工作都不会发生。

在您的情况下，将dask数组放入if语句中时，您会隐式调用.compute()，该语句将事物转换为布尔值。

x = da.random.random(...)  # this is free
y = x + x.T  # this is free
z = y.any()  # this is free

if z:  # everything above happens now, 
    ...

Answer 2

我看了看一下dask的源代码。本质上，当您在快速数组上调用函数时，它将对数组执行“缩减”。直观地讲，这是必要的，因为在后台，模糊数组存储为单独的“块”，可以单独存在于内存，磁盘等中，但是您需要以某种方式将它们的片段拉在一起进行函数调用。

因此，您需要注意的时间是执行还原的初始开销。请注意，如果将阵列的大小增加到2M，则需要大约200的时间。在20M的情况下，仅需要1s的时间。

import dask.array as da
import time

# 200 case
x = da.random.binomial(1, 0.5, 200, 200)
print x.shape
s = time.time()
print "start"
if da.any(x):
    e = time.time()
    print 'duration = ', e-s

# duration =  0.362557172775


# 2M case
x = da.random.binomial(1, 0.5, 2000000, 2000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
    e = time.time()
    print 'duration = ', e-s

# duration =  0.132781982422

# 20M case
x = da.random.binomial(1, 0.5, 20000000, 20000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
    e = time.time()
    print 'duration = ', e-s

# duration =  1.08430886269


# 200M case
x = da.random.binomial(1, 0.5, 200000000, 200000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
    e = time.time()
    print 'duration = ', e-s

# duration =  8.83682179451

如果语句超过了数组

2 个答案: