Question

我运行C和Python代码，在GPU上添加两个数组。但我发现Python代码比C快100倍。

这是我的代码

@ cuda.jit Python

103285 c(01, 02, 10, 11, 12)  
103346 c(08, 09, 10, 11, 12, 01, 02)

运行：python numba_vec_add_float.py 697932185

输出：每格网块数：681575 每块的线程1024 花费时间0.000330924987793秒

CUDA C

import sys
import time
import numpy as np
from numba import cuda

@cuda.jit('void(float32[:], float32[:], float32[:])')
def cu_add(a,b,c):

    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x
    tx = cuda.threadIdx.x

    i = tx + bx * bw

    if i > c.size:
        return

    c[i] = a[i] + b[i]



def main(num):

    device = cuda.get_current_device()

    #num = 100
    #Host memory

    a = np.full(num, 1.0, dtype = np.float32)
    b = np.full(num, 1.0, dtype = np.float32)


    #create device memory

    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_c = cuda.device_array_like(a)

    #tpb = device.WARP_SIZE
    tpb = 1024

    bpg = int(np.ceil(float(num)/tpb))

    print 'Blocks per grid:', bpg
    print 'Threads per block', tpb

    #launch kernel
    st = time.time()

    cu_add[bpg, tpb](d_a, d_b, d_c)

    et = time.time()

    print "Time taken ", (et - st), " seconds"
    c = d_c.copy_to_host()

    for i in xrange(1000):
        if c[i] != 2.0:
            raise Exception
    #print c
if __name__ == "__main__":
    main(int(sys.argv[1]))

编译：nvcc --gpu-architecture = compute_61 nvidia_vector_addition.cu

运行：./ a.out

输出：每格网块数：681575 每个线程的线程数：1024 GPu时间 - ＆gt; 34.359295毫秒 Gpus时间 - ＆gt; 0.034359秒

据观察，@ cuda.jit python比cuda C快103倍。任何人都可以澄清我的做法是对还是错？

Answer 1

在numba案例中，您只测量内核启动开销，而不是运行内核所需的全部时间。在CUDA-C案例中，您将测量运行内核所需的全部时间。

要使numba案例执行与CUDA-C案例类似的衡量，请尝试此修改：

#launch kernel
mystream = cuda.stream()
st = time.time()

cu_add[bpg, tpb, mystream](d_a, d_b, d_c)
mystream.synchronize()
et = time.time()

（来自here）。

为什么@ cuda.jit python程序比它的cuda-C等效更快？

1 个答案: