如何通过numba在CUDA中顺序执行代码?

时间:2019-11-11 15:44:16

标签: cuda gpu numba

我想按顺序执行cuda线程。例如, enter image description here

在上图中的

,我希望索引为[thread_id,j]的值按顺序进给,即,仅当array [0,0],array [0 ,1],array [0,2]等,

我能想到的方法是设置一个全局数组,并连续检索array [0,3]的值。当给出array [0,3]时,我可以输入array [1,2]。

但是,此操作失败,并显示以下代码:

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # loop through other elements in global_array
            for j in range(thread_id+1, N):

                # consistently read values from array
                for _ in range(1000): # or while True:

                    # for thread_id == 0, just execute
                    if thread_id==0:
                        cuda.atomic.add(array,(thread_id,j), 1)
                        break

                    # for thread_id>0
                    else: 

                        # if j reaches the last number of global_array
                        # just execute
                        if j == N-1:
                            cuda.atomic.add(array,(thread_id,j), 1)
                            break
                        else:  

                            # check if the previous thread_id, i.e., thread_id - 1
                            # finishes the execution of combination [thread_id-1,j+1]
                            if array[thread_id-1,j+1]>0:
                                cuda.atomic.add(array,(thread_id,j), 1)
                                break


N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)


output:

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

预期输出应为:

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

我可以想到的另一种方法是使用cuda.syncthreads(),下面是代码:

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # j starts from thread_id + 1
            j = thread_id + 1

            # loop through other elements in global_array
            for i in range(2*N-1):

                if i>2*thread_id:
                    if j<N:
                        cuda.atomic.add(array, (thread_id,j), 1)
                    j+=1
                    cuda.syncthreads()
                else:
                    cuda.syncthreads()

N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)

输出:

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

这当然是可行的。但是,如果global_array的大小大于GPU内核的数量,则当thread_id> GPU内核的数量时,执行将在许多不必要的时间内经历syncthreads()。这很花时间!

同时,块之间的序列化是不可能的。

我有三个问题:

  1. 当我使用原子操作时,为什么上面的第一个代码失败?
  2. 我们有没有更好的方法来实现这一目标?
  3. 对于第一种方法,如何在块之间进行序列化?

2 个答案:

答案 0 :(得分:1)

  
      
  1. 为什么上面的代码失败?
  2.   

因为CUDA执行模型不能保证线程的运行顺序,因此您对执行顺序的假设很可能永远不会成立。另外,您代码中的所有内存事务都是非原子的,因此您似乎试图实现的伪自旋锁也无法正常工作。

  
      
  1. 我们有什么切肉刀方法可以实现这一目标吗?
  2.   

不。无法按照您要求的方式在Numba CUDA中强加执行顺序。

答案 1 :(得分:-2)

经过反复试验,看来cuda.syncthreads()方法是不可能的。

但是,这可以通过设置全局数组来完成,并且在对值进行任何检索之前,我们必须执行原子操作:

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # loop through other elements in global_array
            for j in range(thread_id+1, N):

                # consistently read values from array
                for _ in range(1000): # or while True:

                    # for thread_id == 0, just execute
                    if thread_id==0:
                        cuda.atomic.add(array,(thread_id,j), 1)
                        break
                    # for thread_id>0
                    else: 

                        # if j reaches the last number of global_array
                        # just execute
                        if j == N-1:
                            cuda.atomic.add(array,(thread_id,j), 1)
                            break
                        else:  

                            # check if the previous thread_id, i.e., thread_id - 1
                            # finishes the execution of combination [thread_id-1,j+1]
                            cuda.atomic.max(array, (thread_id-1,j+1), 0)
                            if array[thread_id-1,j+1]>0:
                                cuda.atomic.add(array,(thread_id,j), 1)
                                break


N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)

输出:

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]