Question

我正在尝试将以下操作与cupy并行化：我有一个数组。对于该数组的每一列，我正在生成2个随机向量。我选择该数组列，添加一个向量，减去另一个向量，然后将该新向量作为数组的下一列。我继续进行直到完成阵列。

我已经问了以下问题-Cupy slower than numpy when iterating through array。但这是不同的，因为我相信我遵循以下建议：并行化操作，并使用一个“ for循环”而不是两个，并且仅遍历数组列而不是行和列。

import cupy as cp
import time
#import numpy as cp


def row_size(array):
    return(array.shape[1])

def number_of_rows(array):
    return(array.shape[0])

x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))

x[:,1] = 500000

vector_one = x * 0
vector_two = x * 0

start = time.time()
for i in range(number_of_rows(x) - 1):
    if sum(x[ :, i])!=0:
        vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
        x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]

 time = time.time() - start      
 print(x)
 print(time)

当我在cuppy中运行时，时间大约为0.62秒。

当我切换到numpy时，我1）取消注释#import numpy作为cp和#x = cp.zeros（（200,200））和2）而不是注释import cupy作为cp 和x =（cp.zeros（（200,200），'f'））：

时间大约为0.11秒。

我想也许如果我增加数组大小，例如从（200,200）增加到（2000,2000），那么我会发现Cupy的区别更快，但仍然更慢。

从某种意义上来说，我知道这工作正常，因为如果将cp.random.poisson中的系数从.01更改为.5，我只能在cupy中这样做，因为lambda对于numpy来说太大了。

但还是，我如何使它们变得更快？

Answer 1

通常，在主机（CPU）上循环并迭代处理小型设备（GPU）阵列并不理想，因为与面向列的方法相比，您必须启动大量的独立内核。但是，有时以列为导向的方法不可行。

您可以通过使用CuPy的sum而不是使用Python的内置sum操作来加快CuPy代码的执行速度，该操作会强制设备在每次调用时进行主机传输。话虽如此，您也可以通过切换到NumPy的总和来加速NumPy代码。

import cupy as cp
import time
#import numpy as cp


def row_size(array):
    return(array.shape[1])

def number_of_rows(array):
    return(array.shape[0])

x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))

x[:,1] = 500000

vector_one = x * 0
vector_two = x * 0

start = time.time()
for i in range(number_of_rows(x) - 1):
#     if sum(x[ :, i]) !=0:
    if x[ :, i].sum() !=0: # or you could do: if x[ :, i].sum().get() !=0:
        vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
        x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]

cp.cuda.Device().synchronize() # CuPy is asynchronous, but this doesn't really affect the timing here.

t = time.time() - start      
print(x)
print(t)
[[     0. 500000. 500101. ... 498121. 497922. 497740.]
 [     0. 500000. 499894. ... 502050. 502174. 502112.]
 [     0. 500000. 499989. ... 501703. 501836. 502081.]
 ...
 [     0. 500000. 499804. ... 499600. 499526. 499371.]
 [     0. 500000. 499923. ... 500371. 500184. 500247.]
 [     0. 500000. 500007. ... 501172. 501113. 501254.]]
0.06389498710632324

这个小小的变化将使您的工作流程更快（在我的T4 GPU上为0.06 vs. 0.6秒）。请注意，注释中的.get()方法用于在不相等比较之前将sum操作的结果从GPU显式传输到CPU。这不是必需的，因为CuPy知道如何处理逻辑运算，但是会给您带来很小的额外加速。

对数组的列进行向量的“ for循环”时，cupy比numpy慢

1 个答案: