在推广将二维数组的值向右移动一个空间(围绕行边界)的内核时,我遇到了一个warp同步问题。完整的代码附在下面并包含在内。
该代码适用于任意数组宽度,数组高度,线程块数和每个块的线程数。当选择33的线程大小(即,比完整warp多一个线程)时,调用第33个线程与__syncthreads()
不同步。这会导致输出数据出现问题。问题仅在存在多个warp时出现,并且数组的宽度大于线程数(例如,宽度= 35和34个线程)。
以下是发生的事情的缩小示例(实际上,数组需要有更多元素供内核产生错误)。
初始数组:
0 1 2 3 4
5 6 7 8 9
预期结果:
4 0 1 2 3
9 5 6 7 8
内核产生:
4 0 1 2 3
8 5 6 7 8
第一行正确完成(对于每个块,如果有多个),所有后续行都重复第二个最后一个值。我测试了这两张不同的卡(8600GT和GTX280)并得到了相同的结果。我想知道这是否只是我的内核的一个错误,或者是通过调整我的代码无法修复的问题?
完整的源文件包含在下面。
谢谢。
#include <cstdio>
#include <cstdlib>
// A method to ensure all reads use the same logical layout.
inline __device__ __host__ int loc(int x, int y, int width)
{
return y*width + x;
}
//kernel to shift all items in a 2D array one position to the right (wrapping around rows)
__global__ void shiftRight ( int* globalArray, int width, int height)
{
int temp1=0; //temporary swap variables
int temp2=0;
int blockRange=0; //the number of rows that a single block will shift
if (height%gridDim.x==0) //logic to account for awkward array sizes
blockRange = height/gridDim.x;
else
blockRange = (1+height/gridDim.x);
int yStart = blockIdx.x*blockRange;
int yEnd = yStart+blockRange; //the end condition for the y-loop
yEnd = min(height,yEnd); //make sure that the array doesn't go out of bounds
for (int y = yStart; y < yEnd ; ++y)
{
//do the first read so the swap variables are loaded for the x-loop
temp1 = globalArray[loc(threadIdx.x,y,width)];
//Each block shifts an entire row by itself, even if there are more columns than threads
for (int threadXOffset = threadIdx.x ; threadXOffset < width ; threadXOffset+=blockDim.x)
{
//blockDim.x is added so that we store the next round of values
//this has to be done now, because the next operation will
//overwrite one of these values
temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
__syncthreads(); //sync before the write to ensure all the values have been read
globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
__syncthreads(); //sync after the write so ensure all the values have been written
temp1 = temp2; //swap the storage variables.
}
if (threadIdx.x == 0 && y == 0)
globalArray[loc(12,2,width)]=globalArray[67];
}
}
int main (int argc, char* argv[])
{
//set the parameters to be used
int width = 34;
int height = 3;
int threadsPerBlock=33;
int numBlocks = 1;
int memSizeInBytes = width*height*sizeof(int);
//create the host data and assign each element of the array to equal its index
int* hostData = (int*) malloc (memSizeInBytes);
for (int y = 0 ; y < height ; ++y)
for (int x = 0 ; x < width ; ++x)
hostData [loc(x,y,width)] = loc(x,y,width);
//create an allocate the device pointers
int* deviceData;
cudaMalloc ( &deviceData ,memSizeInBytes);
cudaMemset ( deviceData,0,memSizeInBytes);
cudaMemcpy ( deviceData, hostData, memSizeInBytes, cudaMemcpyHostToDevice);
cudaThreadSynchronize();
//launch the kernel
shiftRight<<<numBlocks,threadsPerBlock>>> (deviceData, width, height);
cudaThreadSynchronize();
//copy the device data to a host array
int* hostDeviceOutput = (int*) malloc (memSizeInBytes);
cudaMemcpy (hostDeviceOutput, deviceData, memSizeInBytes, cudaMemcpyDeviceToHost);
cudaFree (deviceData);
//Print out the expected/desired device output
printf("---- Expected Device Output ----\n");
printf(" | ");
for (int x = 0 ; x < width ; ++x)
printf("%4d ",x);
printf("\n---|-");
for (int x = 0 ; x < width ; ++x)
printf("-----");
for (int y = 0 ; y < height ; ++y)
{
printf("\n%2d | ",y);
for (int x = 0 ; x < width ; ++x)
printf("%4d ",hostData[loc((x-1+width)%width,y,width)]);
}
printf("\n\n");
printf("---- Actual Device Output ----\n");
printf(" | ");
for (int x = 0 ; x < width ; ++x)
printf("%4d ",x);
printf("\n---|-");
for (int x = 0 ; x < width ; ++x)
printf("-----");
for (int y = 0 ; y < height ; ++y)
{
printf("\n%2d | ",y);
for (int x = 0 ; x < width ; ++x)
printf("%4d ",hostDeviceOutput[loc(x,y,width)]);
}
printf("\n\n");
}
答案 0 :(得分:1)
编程指南:
允许
__syncthreads()
条件码,但仅限于 条件评估相同 在整个线程块中, 否则代码执行很可能 悬挂或产生意外的一面 的效果。
在我的示例中,并非所有线程都执行相同数量的循环迭代,因此不会发生同步。
答案 1 :(得分:1)
因为并非所有线程都执行相同数量的循环迭代,所以同步 是个问题!所有线程都应该始终使用相同的__syncthreads() - 。
我建议将你最内层的for循环转换成这样的东西:
for(int blockXOffset=0; blockXOffset < width; blockXOffset+=blockDim.x) {
int threadXOffset=blockXOffset+threadIdx.x;
bool isActive=(threadXOffset < width);
if (isActive) temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
__syncthreads();
if (isActive) globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
__syncthreads();
temp1 = temp2;
}