CUDA字节原子操作仅导致一个线程起作用

时间:2018-11-18 23:19:32

标签: cuda shared-memory atomic

我正在编写一个CUDA程序,该程序在共享内存中定义了一个数组。我需要做的是只允许一个线程在该数组中写入每个索引,即。 e。到达此写指令的第一个线程应更改其值,但在同一扭曲或下一扭曲中的任何其他线程均应读取所写的值。

这是代码段:

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
    printf("copy seq_shared seq_1_index = %d,  block = %d \n", seq_1_index, blockIdx.x);
}

现在发生的事情是,warp中的所有线程都执行这些确切的指令序列,因此if条件中的其余代码将执行32次。我只需要执行一次。

我该如何实现?

1 个答案:

答案 0 :(得分:3)

您可以使用atomicCAS()。它执行原子的Compare-And-Swap操作。

此函数将测试一个变量,如果它符合某个条件(例如,false),它将用另一个值(例如,true)替换它。它会自动完成所有这些事情,即不会被打断。

在这种情况下,原子函数的返回值为我们提供了有用的信息。如果以上示例的返回值为false,则可以确定它已被true代替。我们还可以确定我们是遇到这种情况的“第一个”线程,并且所有其他执行类似操作的线程都将返回true而不是false。

这是一个可行的示例:

$ cat t327.cu
#include <stdio.h>

__global__ void k(){

  __shared__ int flag;
  if (threadIdx.x == 0) flag = 0;
  __syncthreads();

  int retval = atomicCAS(&flag, 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
  // could do if statement on retval here
}


int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$

响应评论中的问题,我们可以通过创建以the programming guide中的char函数为模型的任意原子操作,将其扩展为double atomicAdd()大小的标志。基本思想是,我们将使用支持的数据大小(例如unsigned)执行atomicCAS,并将转换所需的操作以有效支持char大小。这是通过将char地址转换为适当对齐的unsigned地址,然后将char的数量进行移位以在{{1}中的适当字节位置中对齐来完成的。 }值。

这是一个可行的示例:

unsigned

上面给出了$ cat t327.cu #include <stdio.h> __device__ char my_char_atomicCAS(char *addr, char cmp, char val){ unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL)); unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8; unsigned mask = 0xFFU; mask <<= al_offset; mask = ~mask; unsigned sval = val; sval <<= al_offset; unsigned old = *al_addr, assumed, setval; do { assumed = old; setval = assumed & mask; setval |= sval; old = atomicCAS(al_addr, assumed, setval); } while (assumed != old); return (char) ((assumed >> al_offset) & 0xFFU); } __global__ void k(){ __shared__ char flag[1024]; flag[threadIdx.x] = 0; __syncthreads(); int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1); printf("thread %d saw flag as %d\n", threadIdx.x, retval); } int main(){ k<<<1,32>>>(); cudaDeviceSynchronize(); } $ nvcc -o t327 t327.cu $ cuda-memcheck ./t327 ========= CUDA-MEMCHECK thread 0 saw flag as 0 thread 1 saw flag as 1 thread 2 saw flag as 0 thread 3 saw flag as 1 thread 4 saw flag as 0 thread 5 saw flag as 1 thread 6 saw flag as 0 thread 7 saw flag as 1 thread 8 saw flag as 0 thread 9 saw flag as 1 thread 10 saw flag as 0 thread 11 saw flag as 1 thread 12 saw flag as 0 thread 13 saw flag as 1 thread 14 saw flag as 0 thread 15 saw flag as 1 thread 16 saw flag as 0 thread 17 saw flag as 1 thread 18 saw flag as 0 thread 19 saw flag as 1 thread 20 saw flag as 0 thread 21 saw flag as 1 thread 22 saw flag as 0 thread 23 saw flag as 1 thread 24 saw flag as 0 thread 25 saw flag as 1 thread 26 saw flag as 0 thread 27 saw flag as 1 thread 28 saw flag as 0 thread 29 saw flag as 1 thread 30 saw flag as 0 thread 31 saw flag as 1 ========= ERROR SUMMARY: 0 errors $ 大小的广义atomicCAS。这将允许您将任何char值交换为任何其他char值。在您的特定情况下,如果仅需要有效的布尔标志,则可以使用char使此操作更有效,如注释中已提到的那样。使用atomicOr可以消除上面的自定义原子函数中的循环。这是一个可行的示例:

atomicOr

这些$ cat t327.cu #include <stdio.h> __device__ char my_char_atomic_flag(char *addr){ unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL)); unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8; unsigned my_bit = 1U << al_offset; return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU); } __global__ void k(){ __shared__ char flag[1024]; flag[threadIdx.x] = 0; __syncthreads(); int retval = my_char_atomic_flag(flag+(threadIdx.x>>1)); printf("thread %d saw flag as %d\n", threadIdx.x, retval); } int main(){ k<<<1,32>>>(); cudaDeviceSynchronize(); } $ nvcc -o t327 t327.cu $ cuda-memcheck ./t327 ========= CUDA-MEMCHECK thread 0 saw flag as 0 thread 1 saw flag as 1 thread 2 saw flag as 0 thread 3 saw flag as 1 thread 4 saw flag as 0 thread 5 saw flag as 1 thread 6 saw flag as 0 thread 7 saw flag as 1 thread 8 saw flag as 0 thread 9 saw flag as 1 thread 10 saw flag as 0 thread 11 saw flag as 1 thread 12 saw flag as 0 thread 13 saw flag as 1 thread 14 saw flag as 0 thread 15 saw flag as 1 thread 16 saw flag as 0 thread 17 saw flag as 1 thread 18 saw flag as 0 thread 19 saw flag as 1 thread 20 saw flag as 0 thread 21 saw flag as 1 thread 22 saw flag as 0 thread 23 saw flag as 1 thread 24 saw flag as 0 thread 25 saw flag as 1 thread 26 saw flag as 0 thread 27 saw flag as 1 thread 28 saw flag as 0 thread 29 saw flag as 1 thread 30 saw flag as 0 thread 31 saw flag as 1 ========= ERROR SUMMARY: 0 errors $ 原子方法假定您分配了一个char数组,其大小是4的倍数。使用大小为3的char数组来执行此操作是无效的(只有3个线程)。