Question

我有一个带＃34的内核，而＃34;循环，使用有关邻居的信息迭代地更新数组的元素（下面的示例代码中只有一个邻居）。当前迭代中没有元素被更改时，此循环停止。

不幸的是，在某些情况下，部分线程会过早地退出此循环（就像它们忽略同步障碍一样）。每次都会正确处理某些输入，并且每次都会错误地处理其他输入（其中许多输入）（即没有随机因素）。奇怪的是，此错误仅在Release版本中发生，而Debug版本始终工作得很好。更确切地说，CUDA编译器选项＆＃34; -G（生成GPU调试信息）＆＃34;确定是否处理是正确的。始终正确处理大小为32x32或更小的数组。

以下是示例代码：

__global__ void kernel(int *source, int size, unsigned char *result, unsigned char *alpha)
{
    int x = threadIdx.x, y0 = threadIdx.y * 4;
    int i, y;
    __shared__ bool alpha_changed;

    // Zero intermediate array using margins for safe access to neighbors
    const int stride = MAX_SIZE + 2;
    for (i = threadIdx.x + threadIdx.y * blockDim.x; i < stride * (stride + 3); i += blockDim.x * blockDim.y)
    {
        alpha[i] = 0;
    }
    __syncthreads();

    for (int bit = MAX_BITS - 1; bit >= 0; bit--)
    {
        __syncthreads();

        // Fill intermediate array with bit values from input array
        alpha_changed = true;
        alpha[(x + 1) + (y0 + 1) * stride] = (source[x + (y0 + 0) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 2) * stride] = (source[x + (y0 + 1) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 3) * stride] = (source[x + (y0 + 2) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 4) * stride] = (source[x + (y0 + 3) * size] & (1 << bit)) != 0;
        __syncthreads();

        // The loop in question
        while (alpha_changed)
        {
            alpha_changed = false;
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 1) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 2) * stride] != 0 && alpha[(x + 1) + (y0 + 2) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 2) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 3) * stride] != 0 && alpha[(x + 1) + (y0 + 3) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 3) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 4) * stride] != 0 && alpha[(x + 1) + (y0 + 4) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 4) * stride] = 1;
            }
            __syncthreads();
        }
        __syncthreads();

        // Save result
        result[x + (y0 + 0) * size + bit * size * size] = alpha[(x + 1) + (y0 + 1) * stride];
        result[x + (y0 + 1) * size + bit * size * size] = alpha[(x + 1) + (y0 + 2) * stride];
        result[x + (y0 + 2) * size + bit * size * size] = alpha[(x + 1) + (y0 + 3) * stride];
        result[x + (y0 + 3) * size + bit * size * size] = alpha[(x + 1) + (y0 + 4) * stride];
        __syncthreads();
    }
}

// Run only 1 thread block, where size equals 64.
kernel <<< 1, dim3(size, size / 4) >>> (source_gpu, size, result_gpu, alpha_gpu);

此示例内核的预期结果是数组，其中每行只能包含连续的间隔＆＃34; 1＆＃34;值。但不是这样，我得到一些线条，其中＆＃34; 0＆＃34;和＆＃34; 1＆＃34;以某种方式交替。

此错误在我的移动GPU GeForce 740M（Kepler）上，在Windows 7 x64 SP1上，在CUDA 6.0或6.5上重现，使用Visual C ++ 2012或2013.我还可以提供带有示例输入数组的示例Visual Studio项目（即处理不正确）。

我已经尝试过不同的syncthreads（），fences和＆＃34; volatile＆＃34;限定符，但是这个错误残留。

感谢任何帮助。

Answer 1

我认为问题在于您访问alpha_changed。请记住，这只是块中所有线程的一个值。在一个warp重置此变量和另一个warp检查循环条件之间存在竞争条件：

    // The loop in question
    while (alpha_changed)
    {
        alpha_changed = false;
        // ...
        // alpha_changed may be set to true here
        // ...

        __syncthreads();

        // race condition window here. Another warp may already execute
        // the alpha_changed = false; line before this warp continues.
    }

关键是在将共享变量设置为__syncthreads()之前执行false。

您可以在循环内使用局部变量来确定该线程是否进行了任何更改。这样可以避免在整个地方使用__syncthreads()。然后在循环结束时减少：

    // The loop in question
    while (alpha_changed)
    {
        bool alpha_changed_here = false;
        // ...
        // alpha_changed_here may be set to true here
        // ...

        __syncthreads();
        alpha_changed = false;
        __syncthreads();
        // I think you can get away with a simple if-statement here
        // instead of a proper reduction
        if (alpha_changed_here) alpha_changed = true;
        __syncthreads();
    }

据我所知，这种在共享内存中只使用一个变量的方法目前有效。如果您想确定，请使用适当的缩减算法。您可以使用__any()通过一个warp在一条指令中减少32个值。要使用的算法取决于块的大小（我不知道确切的行为是大小不是32的倍数。）

“while”循环内的同步不正确（仅在Release模式下发生）

1 个答案: