Question

我想按顺序执行cuda线程。例如，

在上图中的

，我希望索引为[thread_id，j]的值按顺序进给，即，仅当array [0,0]，array [0 ，1]，array [0,2]等，

我能想到的方法是设置一个全局数组，并连续检索array [0,3]的值。当给出array [0,3]时，我可以输入array [1,2]。

但是，此操作失败，并显示以下代码：

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # loop through other elements in global_array
            for j in range(thread_id+1, N):

                # consistently read values from array
                for _ in range(1000): # or while True:

                    # for thread_id == 0, just execute
                    if thread_id==0:
                        cuda.atomic.add(array,(thread_id,j), 1)
                        break

                    # for thread_id>0
                    else: 

                        # if j reaches the last number of global_array
                        # just execute
                        if j == N-1:
                            cuda.atomic.add(array,(thread_id,j), 1)
                            break
                        else:  

                            # check if the previous thread_id, i.e., thread_id - 1
                            # finishes the execution of combination [thread_id-1,j+1]
                            if array[thread_id-1,j+1]>0:
                                cuda.atomic.add(array,(thread_id,j), 1)
                                break


N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)


output:

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

预期输出应为：

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

我可以想到的另一种方法是使用cuda.syncthreads()，下面是代码：

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # j starts from thread_id + 1
            j = thread_id + 1

            # loop through other elements in global_array
            for i in range(2*N-1):

                if i>2*thread_id:
                    if j<N:
                        cuda.atomic.add(array, (thread_id,j), 1)
                    j+=1
                    cuda.syncthreads()
                else:
                    cuda.syncthreads()

N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)

输出：

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

这当然是可行的。但是，如果global_array的大小大于GPU内核的数量，则当thread_id> GPU内核的数量时，执行将在许多不必要的时间内经历syncthreads（）。这很花时间！

同时，块之间的序列化是不可能的。

我有三个问题：

当我使用原子操作时，为什么上面的第一个代码失败？
我们有没有更好的方法来实现这一目标？
对于第一种方法，如何在块之间进行序列化？

Answer 1

为什么上面的代码失败？

因为CUDA执行模型不能保证线程的运行顺序，因此您对执行顺序的假设很可能永远不会成立。另外，您代码中的所有内存事务都是非原子的，因此您似乎试图实现的伪自旋锁也无法正常工作。

我们有什么切肉刀方法可以实现这一目标吗？

不。无法按照您要求的方式在Numba CUDA中强加执行顺序。

Answer 2

经过反复试验，看来cuda.syncthreads()方法是不可能的。

但是，这可以通过设置全局数组来完成，并且在对值进行任何检索之前，我们必须执行原子操作：

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # loop through other elements in global_array
            for j in range(thread_id+1, N):

                # consistently read values from array
                for _ in range(1000): # or while True:

                    # for thread_id == 0, just execute
                    if thread_id==0:
                        cuda.atomic.add(array,(thread_id,j), 1)
                        break
                    # for thread_id>0
                    else: 

                        # if j reaches the last number of global_array
                        # just execute
                        if j == N-1:
                            cuda.atomic.add(array,(thread_id,j), 1)
                            break
                        else:  

                            # check if the previous thread_id, i.e., thread_id - 1
                            # finishes the execution of combination [thread_id-1,j+1]
                            cuda.atomic.max(array, (thread_id-1,j+1), 0)
                            if array[thread_id-1,j+1]>0:
                                cuda.atomic.add(array,(thread_id,j), 1)
                                break


N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)

输出：

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

如何通过numba在CUDA中顺序执行代码？

2 个答案: