使用cuda python求和向量值

时间:2018-05-22 07:43:47

标签: python vector cuda sum numba

我尝试使用CUDA python对许多向量值求和。我找到了一个使用共享内存Here的解决方案。有没有办法在没有共享内存的情况下执行此操作[因为共享内存有少量内存]?我的矢量大小是:

N = 1000
i = 300000
v[i] = [1,2,..., N]

结果我需要得到:

out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]

感谢您的任何建议:)

1 个答案:

答案 0 :(得分:1)

要一次执行多次缩减,并且对于您指出的问题维度,了解您的向量是按行存储还是按列存储是很重要的。

对于行方式存储方法,逐块并行缩减方法应该非常快。每个块将对单个向量执行a standard sweep-based parallel reduction,然后将结果作为单个数字写入输出。

对于列式存储方法,对于您指出的问题维度(特别是"大的"向量数量),每个线程将是高效的使用遍历列的简单循环执行向量的减少。

以下是两种方法的实例:

# cat t7.py
import numpy as np
import numba as nb
from numba import cuda,float32,int32

#vector length
N = 1000
#number of vectors
NV = 300000
#number of threads per block - must be a power of 2 less than or equal to 1024
threadsperblock = 256
#for vectors arranged row-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_row(vecs, sums):
    sm = cuda.shared.array(threadsperblock, float32)
    bid = cuda.blockIdx.x
    tid = cuda.threadIdx.x
    bdim = cuda.blockDim.x
# load shared memory with vector using block-stride loop
    lid = tid
    sm[lid] = 0
    while lid < N:
        sm[tid] += vecs[bid, lid];
        lid += bdim
    cuda.syncthreads()
# perform shared memory sweep reduction
    sweep = bdim//2
    while sweep > 0:
        if tid < sweep:
            sm[tid] += sm[tid + sweep]
        sweep = sweep//2
        cuda.syncthreads()
    if tid == 0:
        sums[bid] = sm[0]

#for vectors arranged column-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_col(vecs, sums):
    idx = cuda.grid(1)
    if idx >= NV:
        return
    temp = 0
    for i in range(N):
        temp += vecs[i,idx]
    sums[idx] = temp

#peform row-test
rvecs  = np.ones((NV, N), dtype=np.float32)
sums   = np.zeros(NV, dtype=np.float32)
d_rvecs = cuda.to_device(rvecs)
d_sums = cuda.device_array_like(sums)
vec_sum_row[NV, threadsperblock](d_rvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])

#perform column-test
cvecs = np.ones((N, NV), dtype=np.float32)
d_cvecs = cuda.to_device(cvecs)
vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
# python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
# nvprof python t7.py
==5931== NVPROF is profiling process 5931, command: python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
==5931== Profiling application: python t7.py
==5931== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.20%  1.12464s         2  562.32ms  557.25ms  567.39ms  [CUDA memcpy HtoD]
                    0.59%  6.6881ms         1  6.6881ms  6.6881ms  6.6881ms  cudapy::__main__::vec_sum_row$241(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
                    0.20%  2.2250ms         1  2.2250ms  2.2250ms  2.2250ms  cudapy::__main__::vec_sum_col$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
                    0.02%  212.83us         2  106.42us  104.45us  108.38us  [CUDA memcpy DtoH]
      API calls:   73.60%  1.12571s         2  562.85ms  557.77ms  567.94ms  cuMemcpyHtoD
                   25.30%  386.91ms         1  386.91ms  386.91ms  386.91ms  cuDevicePrimaryCtxRetain
                    0.64%  9.8042ms         2  4.9021ms  2.6113ms  7.1929ms  cuMemcpyDtoH
                    0.23%  3.4945ms         3  1.1648ms  182.38us  1.6636ms  cuMemAlloc
                    0.07%  999.98us         2  499.99us  62.409us  937.57us  cuLinkCreate
                    0.04%  678.12us         2  339.06us  331.01us  347.12us  cuModuleLoadDataEx
                    0.03%  458.51us         1  458.51us  458.51us  458.51us  cuMemGetInfo
                    0.03%  431.28us         4  107.82us  98.862us  120.58us  cuDeviceGetName
                    0.03%  409.59us         2  204.79us  200.33us  209.26us  cuLinkAddData
                    0.03%  393.75us         2  196.87us  185.18us  208.56us  cuLinkComplete
                    0.01%  218.68us         2  109.34us  79.726us  138.96us  cuLaunchKernel
                    0.00%  14.052us         3  4.6840us     406ns  11.886us  cuDeviceGetCount
                    0.00%  13.391us        12  1.1150us     682ns  1.5910us  cuDeviceGetAttribute
                    0.00%  13.207us         8  1.6500us  1.0110us  3.1970us  cuDeviceGet
                    0.00%  6.6800us        10     668ns     366ns  1.6910us  cuFuncGetAttribute
                    0.00%  6.3560us         1  6.3560us  6.3560us  6.3560us  cuCtxPushCurrent
                    0.00%  4.1940us         2  2.0970us  1.9810us  2.2130us  cuModuleGetFunction
                    0.00%  4.0220us         4  1.0050us     740ns  1.7010us  cuDeviceComputeCapability
                    0.00%  2.5810us         2  1.2900us  1.1740us  1.4070us  cuLinkDestroy
#

如果您可以选择存储方法,则列式存储是性能的首选。在上面的例子中,行和内核大约需要6.7ms,而列和内核需要大约2.2ms。通过启动较少数量的块并使每个块使用循环执行多次减少,上面的逐行方法可能会有所改进,但它不可能比列方法更快。

请注意,此代码对于每个测试(行和列)需要大约1.5GB的存储空间,因此它不会在具有非常少量内存(例如2GB GPU)的GPU上按原样运行。例如,您可以通过仅进行行测试或列测试,或者通过减少向量的数量,使其在小型内存GPU上运行。