我运行C和Python代码,在GPU上添加两个数组。但我发现Python代码比C快100倍。
这是我的代码
@ cuda.jit Python
103285 c(01, 02, 10, 11, 12)
103346 c(08, 09, 10, 11, 12, 01, 02)
运行:python numba_vec_add_float.py 697932185
输出: 每格网块数:681575 每块的线程1024 花费时间0.000330924987793秒
CUDA C
import sys
import time
import numpy as np
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:])')
def cu_add(a,b,c):
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
tx = cuda.threadIdx.x
i = tx + bx * bw
if i > c.size:
return
c[i] = a[i] + b[i]
def main(num):
device = cuda.get_current_device()
#num = 100
#Host memory
a = np.full(num, 1.0, dtype = np.float32)
b = np.full(num, 1.0, dtype = np.float32)
#create device memory
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(a)
#tpb = device.WARP_SIZE
tpb = 1024
bpg = int(np.ceil(float(num)/tpb))
print 'Blocks per grid:', bpg
print 'Threads per block', tpb
#launch kernel
st = time.time()
cu_add[bpg, tpb](d_a, d_b, d_c)
et = time.time()
print "Time taken ", (et - st), " seconds"
c = d_c.copy_to_host()
for i in xrange(1000):
if c[i] != 2.0:
raise Exception
#print c
if __name__ == "__main__":
main(int(sys.argv[1]))
编译:nvcc --gpu-architecture = compute_61 nvidia_vector_addition.cu
运行:./ a.out
输出: 每格网块数:681575 每个线程的线程数:1024 GPu时间 - > 34.359295毫秒 Gpus时间 - > 0.034359秒
据观察,@ cuda.jit python比cuda C快103倍。任何人都可以澄清我的做法是对还是错?
答案 0 :(得分:3)
在numba案例中,您只测量内核启动开销,而不是运行内核所需的全部时间。在CUDA-C案例中,您将测量运行内核所需的全部时间。
要使numba案例执行与CUDA-C案例类似的衡量,请尝试此修改:
#launch kernel
mystream = cuda.stream()
st = time.time()
cu_add[bpg, tpb, mystream](d_a, d_b, d_c)
mystream.synchronize()
et = time.time()
(来自here)。