Question

我想在CUDA中实现高斯消除。但我在if / else中有线程同步的问题。

这是我的简单代码：

__device__ bool zr(float val) {
    const float zeroEpsilon = 1e-12f;
    return fabs(val) < zeroEpsilon;
}

__global__ void gauss(float* data, unsigned int size, bool* success) {
    //unsigned int len = size * (size + 1);
    extern  __shared__ float matrix[];
    __shared__ bool succ;
    __shared__ float div;
    unsigned int ridx = threadIdx.y;
    unsigned int cidx = threadIdx.x;
    unsigned int idx = (size + 1) * ridx  + cidx;
    matrix[idx] = data[idx];
    if (idx == 0)
        succ = true;
    __syncthreads();
    for (unsigned int row = 0; row < size; ++row) {
        if (ridx == row) {
            if (cidx == row) {
                div = matrix[idx];
                if (zr(div)) {
                    succ = false;
                    div = 1.0;
                }
            }
            __syncthreads();
            matrix[idx] = matrix[idx] / div;
            __syncthreads();
        }
        else {
            __syncthreads();
            __syncthreads();
        }
        if (!succ)
            break;
    }
    __syncthreads();
    if (idx == 0)
        *success = succ;
    data[idx] = matrix[idx];
    __syncthreads();
}

它的工作原理如下：

将矩阵复制到共享内存中。
迭代行。
在对角线上按行划分。

问题出在if / else内部for循环内部 - 死锁：

==Ocelot== PTX Emulator failed to run kernel "_Z5gaussPfjPb" with exception: 
==Ocelot== [PC 30] [thread 0] [cta 0] bar.sync 0 - barrier deadlock:
==Ocelot== context at: [PC: 59] gauss.cu:57:1 11111111111111111111
==Ocelot== context at: [PC: 50] gauss.cu:54:1 11111111111111111111
==Ocelot== context at: [PC: 33] gauss.cu:40:1 00000000000000011111
==Ocelot== context at: [PC: 30] gauss.cu:51:1 11111111111111100000

我不知道为什么会这样。当我从if / else块中删除同步时，它可以正常工作。有人可以解释一下吗？

Answer 1

__syncthreads()等待一个线程块的所有线程都到达这一点并完成计算。由于你的if / else-condition，一些线程在else-loop中等待，而一些线程在if-loop中等待它们彼此等待。但if循环中的线程永远不会到达else循环。

Answer 2

__syncthreads()正在这样做。

当thread作为指令到达__syncthreads时它将阻塞/停止，当发生这种情况时warp（32个线程）也将阻塞，它将阻塞直到所有{{1}同一个threads已达到该声明。

但是，如果同一个block of threads中的一个warp或一个线程没有达到相同的block of threads语句，那么它将会死锁，因为至少有一个__syncthreads正在等待所有其他thread threads 1}}达到相同的陈述，如果没有发生，你将得到死锁。

您现在正在做的是从参与threads事件中排除至少一个__syncthreads，将__syncthreads置于if语句中并非所有线程都会到达。因此，死锁。

CUDA中if / else块内的线程同步

2 个答案: