Question

我正在尝试按照此tutorial -

在cuda上编写并行前缀扫描

我正在尝试工作效率低下的“双缓冲”，如教程中所述。

这就是我所拥有的：

// double buffered naive.

// d = number of iterations, N - size, and input.
__global__ void prefixsum(int* in, int d, int N)
{

        //get the block index
        int idx = blockIdx.x*blockDim.x + threadIdx.x;

        // allocate shared memory
        extern __shared__ int temp_in[], temp_out[];

        // copy data to it.
        temp_in[idx] = in[idx];
        temp_out[idx] = 0;

        // block until all threads copy

        __syncthreads();

        int i = 1;
        for (i; i<=d; i++)
        {
                if (idx < N+1 && idx >= (int)pow(2.0f,(float)i-1))
                {
                        // copy new result to temp_out
                        temp_out[idx] += temp_in[idx - (int)pow(2.0f,(float)i-1)] + temp_in[idx];
                }
                else
                {
                        // if the element is to remain unchanged, copy the same thing
                        temp_out[idx] = temp_in[idx];
                }
                // block until all theads do this
                __syncthreads();
                // copy the result to temp_in for next iteration
                temp_in[idx] = temp_out[idx];
                // wait for all threads to do so
                __syncthreads();
        }

        //finally copy everything back to global memory
        in[idx] = temp_in[idx];
}

你能指出这有什么问题吗？我已经就我认为应该发生的事情写了评论。

这是内核调用 -

prefixsum<<<dimGrid,dimBlock>>>(d_arr, log(SIZE)/log(2), N);

这是网格和块分配：

dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);

问题是我没有为任何长度超过8个元素的输入获得正确的输出。

Answer 1

我在你的代码中看到两个问题

问题1：外部共享内存

唉....我讨厌extern __shared__记忆。问题是，编译器不知道数组有多大。结果，他们都指向同一块记忆！因此，在您的情况下：temp_in[5]和temp_out[5]引用共享内存中的相同字词。

如果你真的想要extern __shared__内存，可以手动偏移第二个数组，例如：

size_t size = .... //the size of your array
extern __shared__ int memory[];
int* temp_in=memory;
int* temp_out=memory+size;

问题2：共享数组索引

每个块的共享内存都是私有的。也就是说，一个块中的temp[0]可能与另一个块中的temp[0]不同。但是，您可以按blockIdx.x*blockDim.x + threadIdx.x对其进行索引，就像在块之间共享临时数组一样。

相反，您最有可能只通过threadIdx.x索引临时数组。

当然，idx数组是全局的，您可以正确地索引该数组。

cuda共享内存覆盖？

1 个答案: