Question

Numba缺少cuda-C命令gridsync（），因此没有用于在整个网格上同步的固定方法。仅块级同步可用。

如果cudaKernal1的执行时间非常快，则以下代码的运行速度将提高1000倍

for i in range(10000):
   X = X + cudaKernel1[(100,100),(32,32)] (X)

通过将循环放入同一内核中来避免gpu内核设置时间。但是您不能这样做，因为您需要在下一次迭代开始之前完成所有网格，并且Numba中没有gridsync（）命令。

这是在numba中执行gridsync（）的一种明显方法，因此您会认为人们会使用此方法，但是我找不到任何此类示例。

但是，我发现了很多关于stackoverflow的评论，但没有说明-试图使用原子计数器在整个网格上同步块是毫无意义，不安全的，否则将在竞争条件下陷入僵局。相反，他们建议在两个步骤之间退出内核。但是，如果每个步骤都非常快，那么调用内核要比执行它花费更多的时间，因此，如果您可以循环执行这些步骤而不退出，则可以快1000倍。

我无法弄清楚什么是不安全的，或者为什么会有竞赛条件会带来陷阱。

以下内容有什么问题。

@numba.cuda.jit('void()')
def gpu_initGridSync():
    if ( cuda.threadIdx.x == 0): 
        Global_u[0] = 0
        Global_u[1] = 0

@numba.cuda.jit('void(int32)'device=True)
def gpu_fakeGridSync(i):
    ###wait till the the entire grid has finished doSomething()
    # in Cuda-C we'd call gridsync()
    # but lack that in Numba so do the following instead.

    #Syncthreads in current block
    numba.cuda.syncthreads()

    #increment global counter, once per block
    if ( cuda.threadIdx.x == 0 ):  numba.atomic.add( Global_u, 0, 1 )

    # idle in a loop
    while ( Global_u[0] < (i+1)*cuda.gridDim.x-1 ) ):  pass   #2

    #regroup the block threads after the slow global memory reads.
    numba.cuda.syncthreads()

    # now, to avoid a race condition of blocks re-entering the above while
    # loop before other blocks have exited we do this global sync a second time

     #increment global counter, once per block
    if ( cuda.threadIdx.x == 0 ):  numba.atomic.add( Global_u,1, 1 )

    # idle in a loop
    while ( Global_u[1] > (i+2)*cuda.gridDim.x ) ):  pass   #2

    #regroup the block threads after the slow global memory reads.
    numba.cuda.syncthreads()

然后这样使用：

@numba.cuda.jit('void(float32[:])')):
def ReallyReallyFast(X):
    i = numba.cuda.grid(1)
    for h in range(1,40000,4):
        temp = calculateSomething(X)
        gpu_fakeGridSync(h)
        X[i] = X[i]+temp
        gpu_fakeGridSync(h+2)

gpu_initGridSync[(1,),(1,)]()
ReallyReallyFast[(1000,), (32,) ](X)


@numba.cuda.jit('float32(float32[:])',device=True):
def calculateSomething(X):  # A dummy example of a very fast kernel operation
    i = numba.cuda.grid(1)
    if (i>0):
        return (X[i]-X[i-1])/2.0
    return 0.0

在我看来，这在逻辑上是合理的。初始化全局计数器只有一个微妙的步骤。必须在其自己的内核调用中完成此操作，以避免出现竞争情况。但是在那之后，我可以自由调用fakeGridSync，而无需重新初始化它。我确实必须跟踪我在调用循环迭代的方式（因此将传入的参数传递给gridSync）。

我承认我可以看到有一些浪费的精力，但这是交易杀手吗？例如，在语句＃2中，此while循环意味着所有完成的块中的所有线程都在浪费精力。我想这可能会稍微降低仍在尝试执行“ doSomething”的网格块的速度。我不确定浪费的精力有多严重。关于语句2的第二个nitpick是，所有线程都争用同一个全局内存，因此访问它们的速度将很慢。如果这意味着调度程序推迟执行并让有用的线程更频繁地执行，那么这甚至可能是一件好事。可以通过在每个块中仅检查线程（0）是否存在冲突来改善这种天真的代码。

Answer 1

我认为Robert Crovella的评论指出了此方法失败的正确答案。

我错误地认为调度程序执行抢先式多任务处理，以便所有块都可以运行一个时间片。

当前，Nvidia GPU尚没有抢先式多任务调度程序。作业完成。

因此，一旦有足够的块进入while循环等待，则调度程序将不会启动剩余的块。因此，等待循环将永远等待。

我看到有研究论文建议Nvidia如何使它成为调度程序的先发制人。 https://www.computer.org/csdl/proceedings/snpd/2012/2120/00/06299288.pdf 但是显然现在不是这种情况。

我不知道cuda-C如何成功完成gridSync（）命令。如果可以在C中完成，则必须有一些通用的方法来解决这些限制。我希望这是一个谜，希望有人在下面发表评论

在桌子上留下1000倍的加速速度真是可惜。

这样在Numba中实现cuda gridsync（）是否安全？

1 个答案: