,我希望索引为[thread_id,j]的值按顺序进给,即,仅当array [0,0],array [0 ,1],array [0,2]等,
我能想到的方法是设置一个全局数组,并连续检索array [0,3]的值。当给出array [0,3]时,我可以输入array [1,2]。
但是,此操作失败,并显示以下代码:
import math
import numpy as np
from numba import cuda
@cuda.jit
def keep_max(global_array,array):
thread_id = cuda.grid(1)
if thread_id<N:
# loop through other elements in global_array
for j in range(thread_id+1, N):
# consistently read values from array
for _ in range(1000): # or while True:
# for thread_id == 0, just execute
if thread_id==0:
cuda.atomic.add(array,(thread_id,j), 1)
break
# for thread_id>0
else:
# if j reaches the last number of global_array
# just execute
if j == N-1:
cuda.atomic.add(array,(thread_id,j), 1)
break
else:
# check if the previous thread_id, i.e., thread_id - 1
# finishes the execution of combination [thread_id-1,j+1]
if array[thread_id-1,j+1]>0:
cuda.atomic.add(array,(thread_id,j), 1)
break
N = 10
global_array = np.arange(N)
array = np.zeros([N,N])
# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)
output:
[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
预期输出应为:
[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
我可以想到的另一种方法是使用cuda.syncthreads()
,下面是代码:
import math
import numpy as np
from numba import cuda
@cuda.jit
def keep_max(global_array,array):
thread_id = cuda.grid(1)
if thread_id<N:
# j starts from thread_id + 1
j = thread_id + 1
# loop through other elements in global_array
for i in range(2*N-1):
if i>2*thread_id:
if j<N:
cuda.atomic.add(array, (thread_id,j), 1)
j+=1
cuda.syncthreads()
else:
cuda.syncthreads()
N = 10
global_array = np.arange(N)
array = np.zeros([N,N])
# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)
输出:
[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
这当然是可行的。但是,如果global_array的大小大于GPU内核的数量,则当thread_id> GPU内核的数量时,执行将在许多不必要的时间内经历syncthreads()。这很花时间!
同时,块之间的序列化是不可能的。
我有三个问题:
答案 0 :(得分:1)
- 为什么上面的代码失败?
因为CUDA执行模型不能保证线程的运行顺序,因此您对执行顺序的假设很可能永远不会成立。另外,您代码中的所有内存事务都是非原子的,因此您似乎试图实现的伪自旋锁也无法正常工作。
- 我们有什么切肉刀方法可以实现这一目标吗?
不。无法按照您要求的方式在Numba CUDA中强加执行顺序。
答案 1 :(得分:-2)
经过反复试验,看来cuda.syncthreads()
方法是不可能的。
但是,这可以通过设置全局数组来完成,并且在对值进行任何检索之前,我们必须执行原子操作:
import math
import numpy as np
from numba import cuda
@cuda.jit
def keep_max(global_array,array):
thread_id = cuda.grid(1)
if thread_id<N:
# loop through other elements in global_array
for j in range(thread_id+1, N):
# consistently read values from array
for _ in range(1000): # or while True:
# for thread_id == 0, just execute
if thread_id==0:
cuda.atomic.add(array,(thread_id,j), 1)
break
# for thread_id>0
else:
# if j reaches the last number of global_array
# just execute
if j == N-1:
cuda.atomic.add(array,(thread_id,j), 1)
break
else:
# check if the previous thread_id, i.e., thread_id - 1
# finishes the execution of combination [thread_id-1,j+1]
cuda.atomic.max(array, (thread_id-1,j+1), 0)
if array[thread_id-1,j+1]>0:
cuda.atomic.add(array,(thread_id,j), 1)
break
N = 10
global_array = np.arange(N)
array = np.zeros([N,N])
# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)
输出:
[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]