我尝试使用CUDA python对许多向量值求和。我找到了一个使用共享内存Here的解决方案。有没有办法在没有共享内存的情况下执行此操作[因为共享内存有少量内存]?我的矢量大小是:
N = 1000
i = 300000
v[i] = [1,2,..., N]
结果我需要得到:
out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]
感谢您的任何建议:)
答案 0 :(得分:1)
要一次执行多次缩减,并且对于您指出的问题维度,了解您的向量是按行存储还是按列存储是很重要的。
对于行方式存储方法,逐块并行缩减方法应该非常快。每个块将对单个向量执行a standard sweep-based parallel reduction,然后将结果作为单个数字写入输出。
对于列式存储方法,对于您指出的问题维度(特别是"大的"向量数量),每个线程将是高效的使用遍历列的简单循环执行向量的减少。
以下是两种方法的实例:
# cat t7.py
import numpy as np
import numba as nb
from numba import cuda,float32,int32
#vector length
N = 1000
#number of vectors
NV = 300000
#number of threads per block - must be a power of 2 less than or equal to 1024
threadsperblock = 256
#for vectors arranged row-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_row(vecs, sums):
sm = cuda.shared.array(threadsperblock, float32)
bid = cuda.blockIdx.x
tid = cuda.threadIdx.x
bdim = cuda.blockDim.x
# load shared memory with vector using block-stride loop
lid = tid
sm[lid] = 0
while lid < N:
sm[tid] += vecs[bid, lid];
lid += bdim
cuda.syncthreads()
# perform shared memory sweep reduction
sweep = bdim//2
while sweep > 0:
if tid < sweep:
sm[tid] += sm[tid + sweep]
sweep = sweep//2
cuda.syncthreads()
if tid == 0:
sums[bid] = sm[0]
#for vectors arranged column-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_col(vecs, sums):
idx = cuda.grid(1)
if idx >= NV:
return
temp = 0
for i in range(N):
temp += vecs[i,idx]
sums[idx] = temp
#peform row-test
rvecs = np.ones((NV, N), dtype=np.float32)
sums = np.zeros(NV, dtype=np.float32)
d_rvecs = cuda.to_device(rvecs)
d_sums = cuda.device_array_like(sums)
vec_sum_row[NV, threadsperblock](d_rvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
#perform column-test
cvecs = np.ones((N, NV), dtype=np.float32)
d_cvecs = cuda.to_device(cvecs)
vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
# python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
# nvprof python t7.py
==5931== NVPROF is profiling process 5931, command: python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
==5931== Profiling application: python t7.py
==5931== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.20% 1.12464s 2 562.32ms 557.25ms 567.39ms [CUDA memcpy HtoD]
0.59% 6.6881ms 1 6.6881ms 6.6881ms 6.6881ms cudapy::__main__::vec_sum_row$241(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.20% 2.2250ms 1 2.2250ms 2.2250ms 2.2250ms cudapy::__main__::vec_sum_col$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.02% 212.83us 2 106.42us 104.45us 108.38us [CUDA memcpy DtoH]
API calls: 73.60% 1.12571s 2 562.85ms 557.77ms 567.94ms cuMemcpyHtoD
25.30% 386.91ms 1 386.91ms 386.91ms 386.91ms cuDevicePrimaryCtxRetain
0.64% 9.8042ms 2 4.9021ms 2.6113ms 7.1929ms cuMemcpyDtoH
0.23% 3.4945ms 3 1.1648ms 182.38us 1.6636ms cuMemAlloc
0.07% 999.98us 2 499.99us 62.409us 937.57us cuLinkCreate
0.04% 678.12us 2 339.06us 331.01us 347.12us cuModuleLoadDataEx
0.03% 458.51us 1 458.51us 458.51us 458.51us cuMemGetInfo
0.03% 431.28us 4 107.82us 98.862us 120.58us cuDeviceGetName
0.03% 409.59us 2 204.79us 200.33us 209.26us cuLinkAddData
0.03% 393.75us 2 196.87us 185.18us 208.56us cuLinkComplete
0.01% 218.68us 2 109.34us 79.726us 138.96us cuLaunchKernel
0.00% 14.052us 3 4.6840us 406ns 11.886us cuDeviceGetCount
0.00% 13.391us 12 1.1150us 682ns 1.5910us cuDeviceGetAttribute
0.00% 13.207us 8 1.6500us 1.0110us 3.1970us cuDeviceGet
0.00% 6.6800us 10 668ns 366ns 1.6910us cuFuncGetAttribute
0.00% 6.3560us 1 6.3560us 6.3560us 6.3560us cuCtxPushCurrent
0.00% 4.1940us 2 2.0970us 1.9810us 2.2130us cuModuleGetFunction
0.00% 4.0220us 4 1.0050us 740ns 1.7010us cuDeviceComputeCapability
0.00% 2.5810us 2 1.2900us 1.1740us 1.4070us cuLinkDestroy
#
如果您可以选择存储方法,则列式存储是性能的首选。在上面的例子中,行和内核大约需要6.7ms,而列和内核需要大约2.2ms。通过启动较少数量的块并使每个块使用循环执行多次减少,上面的逐行方法可能会有所改进,但它不可能比列方法更快。
请注意,此代码对于每个测试(行和列)需要大约1.5GB的存储空间,因此它不会在具有非常少量内存(例如2GB GPU)的GPU上按原样运行。例如,您可以通过仅进行行测试或列测试,或者通过减少向量的数量,使其在小型内存GPU上运行。