Question

我正在做作业并且已经获得了执行原始扫描操作的Cuda内核。据我所知，如果使用单个块（由于int id = threadInx.x），该内核将只扫描数据。这是真的吗？

//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
    //we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
    //cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
    //the size of this mememory allocation
    __shared__ int temp[1024 * 2];

    //instantiate variables
    int id = threadIdx.x;
    int pout = 0, pin = 1;

    // // load input into shared memory. 
    // // Exclusive scan: shift right by one and set first element to 0
    temp[id] = (id > 0) ? in_data[id - 1] : 0;
    __syncthreads();


    //for each thread, loop through each of the steps
    //each step, move the next resultant addition to the thread's 
    //corresponding space to manipulted for the next iteration
    for (int offset = 1; offset < numElements; offset <<= 1)
    {
        //these switch so that data can move back and fourth between the extra spaces
        pout = 1 - pout;
        pin = 1 - pout;

        //IF: the number needs to be added to something, make sure to add those contents with the contents of 
        //the element offset number of elements away, then move it to its corresponding space
        //ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
        if (id >= offset)
        {
            //this element needs to be added to something; do that and copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
        }
        else
        {
            //this element just drops down, so copy it over
            temp[pout * numElements + id] = temp[pin * numElements + id];
        }
        __syncthreads();
    }

    // write output
    out_data[id] = temp[pout * numElements + id];
}

我想修改此内核以跨多个块工作，我希望它像将int id...更改为int id = threadIdx.x + blockDim.x * blockIdx.x一样简单。但共享内存仅在块内，这意味着跨块的扫描内核无法共享适当的信息。

Answer 1

据我所知，如果使用单个块，这个内核将只扫描数据（因为int id = threadInx.x）。这是真的吗？

不完全是。无论启动了多少个块，此内核都将 ，但所有块将获取相同的输入并计算相同的输出，因为id的计算方式如何：

int id = threadIdx.x;

此id与blockIdx无关，因此无论其编号如何，都会在各个块中相同。

如果我在不改变太多代码的情况下制作这种扫描的多块版本，我会引入一个辅助数组来存储每个块的总和。然后，对该阵列运行类似的扫描，计算每个块的增量。最后，运行最后一个内核，将 add 每个块的增量添加到块元素中。如果内存服务，则CUDA SDK示例中有类似的内核。

由于Kepler可以更有效地重写上述代码，特别是通过使用__shfl。此外，将算法更改为每个warp而不是per-block可以摆脱__syncthreads并可以提高性能。这些改进的组合将允许您摆脱共享内存并仅使用寄存器以获得最佳性能。

这个Cuda扫描内核是仅在单个块中工作，还是在多个块中工作？

1 个答案: