我正在编写一个CUDA程序,该程序在共享内存中定义了一个数组。我需要做的是只允许一个线程在该数组中写入每个索引,即。 e。到达此写指令的第一个线程应更改其值,但在同一扭曲或下一扭曲中的任何其他线程均应读取所写的值。
这是代码段:
char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d \n", seq_1_index, blockIdx.x);
}
现在发生的事情是,warp中的所有线程都执行这些确切的指令序列,因此if条件中的其余代码将执行32次。我只需要执行一次。
我该如何实现?
答案 0 :(得分:3)
您可以使用atomicCAS()
。它执行原子的Compare-And-Swap操作。
此函数将测试一个变量,如果它符合某个条件(例如,false),它将用另一个值(例如,true)替换它。它会自动完成所有这些事情,即不会被打断。
在这种情况下,原子函数的返回值为我们提供了有用的信息。如果以上示例的返回值为false,则可以确定它已被true代替。我们还可以确定我们是遇到这种情况的“第一个”线程,并且所有其他执行类似操作的线程都将返回true而不是false。
这是一个可行的示例:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
响应评论中的问题,我们可以通过创建以the programming guide中的char
函数为模型的任意原子操作,将其扩展为double atomicAdd()
大小的标志。基本思想是,我们将使用支持的数据大小(例如unsigned
)执行atomicCAS,并将转换所需的操作以有效支持char
大小。这是通过将char
地址转换为适当对齐的unsigned
地址,然后将char
的数量进行移位以在{{1}中的适当字节位置中对齐来完成的。 }值。
这是一个可行的示例:
unsigned
上面给出了$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
大小的广义atomicCAS
。这将允许您将任何char
值交换为任何其他char
值。在您的特定情况下,如果仅需要有效的布尔标志,则可以使用char
使此操作更有效,如注释中已提到的那样。使用atomicOr
可以消除上面的自定义原子函数中的循环。这是一个可行的示例:
atomicOr
这些$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
原子方法假定您分配了一个char
数组,其大小是4的倍数。使用大小为3的char
数组来执行此操作是无效的(只有3个线程)。