Question

大约两年前，我写了一个内核，用于同时处理几个数字网格。出现了一些非常奇怪的行为，导致错误的结果。当使用printf（） - 内核中的语句查找错误时，bug就消失了。

由于截止日期限制，我保持这种方式，尽管最近我认为这不是合适的编码风格。所以我重新访问了我的内核并将其归结为您在下面看到的内容。

__launch_bounds__(672, 2) __global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius, int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids) { __syncthreads(); int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread) int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread) int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread) int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread) while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids { __syncthreads(); int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread) float uchange = 0.0f; //uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1") float du = 0.0f; if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids)) { if (id_r == 0) // FO-forward difference du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]); else if (id_r == numNodesPerGrid - 1) // FO-rearward difference du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]); else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]); else if(id_r > 1 && id_r < numNodesPerGrid - 2) du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2])); else du = 0; } __syncthreads(); if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids)) { d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2") d_du[ id_numInArray] = du; } __syncthreads(); ++id_grid; }

这个内核计算了许多数值1D网格的所有网格点的某个值的导数。

需要考虑的事项（请参阅底部的完整代码库）

网格由1300个网格点组成

每个网格必须由两个块处理（由于内存/寄存器限制）

每个块依次在37个网格上运行（或者更好：网格一半，while循环处理它）

每个线程负责每个网格中的相同网格点

对于要计算的导数，线程需要访问来自四个下一个网格点的数据

为了使块彼此独立，引入网格上的小重叠（每个网格的网格点666,667,668,669由来自不同块的两个线程读取，但只有一个线程是给他们写信，这是问题发生的重叠）

由于沸腾过程，块两侧的两个线程没有计算，原来它们负责将相应的网格值写入共享内存

网格的值存储在u_arr，du_arr和r_arr（及其相应的设备数组d_u，d_du和{{1} }）。每个网格在每个阵列中占用1300个连续值。内核中的while循环为每个块迭代超过37个网格。

为了评估内核的工作原理，每个网格都使用完全相同的值进行初始化，因此确定性程序将为每个网格生成相同的结果。我的代码不会发生这种情况。

Heisenbug的古怪：

我将网格0的计算值与其他每个网格进行了比较，并且在重叠处存在差异（网格点666-669），但不一致。有些网格有正确的值，有些则没有。连续两次运行会将不同的网格标记为错误。首先想到的是，这个重叠的两个线程试图同时写入内存，虽然情况似乎并非如此（我检查了......并重新检查）。

注释或取消注释行或使用d_r进行调试将会改变该计划的结果：当＆＃34;询问＆＃34;负责网格点的线程，他们告诉我一切都很好，他们实际上是正确的。一旦我强制一个线程打印出它的变量，它们就会被正确计算（更重要的是：存储）。使用Nsight Eclipse进行调试也是如此。

Memcheck / Racecheck：

cuda-memcheck（memcheck和racecheck）报告没有内存/竞争条件问题，但即使使用其中一个工具也能够影响结果的正确性。 Valgrind给出了一些警告，但我认为它们与CUDA API有关，我无法影响它，这似乎与我的问题无关。

（更新） 正如所指出的，printf()仅适用于共享内存竞争条件，而手头的问题在cuda-memcheck --tool racecheck上具有竞争条件，即全局内存。

测试环境：

原始内核已经在不同的CUDA设备上进行了测试，具有不同的计算能力（2.0,3.0和3.5），每个配置中都会出现错误（以某种形式或其他形式）。

我的（主要）测试系统如下：

2 x GTX 460，在同时运行X-server的GPU上进行了测试另一个

驱动程序版本：340.46

Cuda Toolkit 6.5

Linux Kernel 3.11.0-12-generic（Linux Mint 16 - Xfce）

解决方案的状态：

到目前为止，我很确定一些内存访问是罪魁祸首，可能是编译器的一些优化或使用未初始化的值，而且我显然不了解一些基本的CUDA范例。内核中的d_u语句（通过一些黑暗魔法必须利用设备和主机内存）和memcheck算法（cuda-memcheck和valgrind）影响 bevavior指向同一方向。

我很抱歉这个有点复杂的内核，但是我尽可能地将原始内核和调用放到了最后，这就是我所知道的。到目前为止，我已经学会了欣赏这个问题，我期待着了解这里发生了什么。

两个＆＃34;解决方案＆＃34;，强制内核按预期工作，在代码中标记。

（更新）正如下面正确答案中所提到的，我的代码问题是线程块边界处的竞争条件。由于每个网格上有两个块，并且无法保证哪个块首先工作，因此导致下面列出的行为。它还解释了使用＆＃34;解决方案1＆＃34;时的正确结果。如代码中所述，因为printf()时输入/输出值d_u不会改变。

简单的解决方案是将此内核拆分为两个内核，一个计算uchange = 1.0，另一个计算派生d_u。更理想的是只有一个内核调用而不是两个，尽管我不知道如何使用d_du完成此操作。使用-arch=sm_20可能会使用动态并行来实现这一点，尽管第二次内核调用的开销可以忽略不计。

heisenbug.cu：

-arch=sm_35

生成文件：

#include <cuda.h> #include <cuda_runtime.h> #include <stdio.h> const float r_sol = 6.955E8f; __constant__ float d_fourpoint_constant = 0.2f; __launch_bounds__(672, 2) __global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius, int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids) { __syncthreads(); int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread) int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread) int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread) int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread) while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids { __syncthreads(); int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread) float uchange = 0.0f; //uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1") float du = 0.0f; if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids)) { if (id_r == 0) // FO-forward difference du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]); else if (id_r == numNodesPerGrid - 1) // FO-rearward difference du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]); else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]); else if(id_r > 1 && id_r < numNodesPerGrid - 2) du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2])); else du = 0; } __syncthreads(); if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids)) { d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2") d_du[ id_numInArray] = du; } __syncthreads(); ++id_grid; } } bool gridValuesEqual(float *matarray, uint id0, uint id1, const char *label, int numNodesPerGrid){ bool retval = true; for(uint i=0; i<numNodesPerGrid; ++i) if(matarray[id0 * numNodesPerGrid + i] != matarray[id1 * numNodesPerGrid + i]) { printf("value %s at position %u of grid %u not equal that of grid %u: %E != %E, diff: %E\n", label, i, id0, id1, matarray[id0 * numNodesPerGrid + i], matarray[id1 * numNodesPerGrid + i], matarray[id0 * numNodesPerGrid + i] - matarray[id1 * numNodesPerGrid + i]); retval = false; } return retval; } int main(int argc, const char* argv[]) { float *d_u; float *d_du; float *d_r; float *u_arr; float *du_arr; float *r_arr; int numNodesPerGrid = 1300; int numBlocksPerSM = 2; int numGridsPerSM = 37; int numSM = 7; int TPB = 672; int radius = 2; int numGrids = 259; int memsize_grid = sizeof(float) * numNodesPerGrid; int numBlocksPerGrid = numNodesPerGrid / (TPB - 2 * radius) + (numNodesPerGrid%(TPB - 2 * radius) == 0 ? 0 : 1); printf("---------------------------------------------------------------------------\n"); printf("--- Heisenbug Extermination Tracker ---------------------------------------\n"); printf("---------------------------------------------------------------------------\n\n"); cudaSetDevice(0); cudaDeviceReset(); cudaMalloc((void **) &d_u, memsize_grid * numGrids); cudaMalloc((void **) &d_du, memsize_grid * numGrids); cudaMalloc((void **) &d_r, memsize_grid * numGrids); u_arr = new float[numGrids * numNodesPerGrid]; du_arr = new float[numGrids * numNodesPerGrid]; r_arr = new float[numGrids * numNodesPerGrid]; for(uint k=0; k<numGrids; ++k) for(uint i=0; i<numNodesPerGrid; ++i) { uint index = k * numNodesPerGrid + i; if (i < 585) r_arr[index] = i * (6000.0f); else { if (i == 585) r_arr[index] = r_arr[index - 1] + 8.576E-6f * r_sol; else r_arr[index] = r_arr[index - 1] + 1.02102f * ( r_arr[index - 1] - r_arr[index - 2] ); } u_arr[index] = 1E-10f * (i+1); du_arr[index] = 0.0f; } /* printf("\n\nbefore kernel start\n\n"); for(uint k=0; k<numGrids; ++k) printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/ bool equal = true; for(int k=1; k<numGrids; ++k) { equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid); equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid); equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid); } if(!equal) printf("Input values are not identical for different grids!\n\n"); else printf("All grids contain the same values at same grid points.!\n\n"); cudaMemcpy(d_u, u_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice); cudaMemcpy(d_du, du_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice); cudaMemcpy(d_r, r_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice); printf("Configuration:\n\n"); printf("numNodesPerGrid:\t%i\nnumBlocksPerSM:\t\t%i\nnumGridsPerSM:\t\t%i\n", numNodesPerGrid, numBlocksPerSM, numGridsPerSM); printf("numSM:\t\t\t\t%i\nTPB:\t\t\t\t%i\nradius:\t\t\t\t%i\nnumGrids:\t\t\t%i\nmemsize_grid:\t\t%i\n", numSM, TPB, radius, numGrids, memsize_grid); printf("numBlocksPerGrid:\t%i\n\n", numBlocksPerGrid); printf("Kernel launch parameters:\n\n"); printf("moduleA2_3<<<%i, %i, %i>>>(...)\n\n", numBlocksPerSM * numSM, TPB, 0); printf("Launching Kernel...\n\n"); heisenkernel<<<numBlocksPerSM * numSM, TPB, 0>>>(d_u, d_r, d_du, radius, numNodesPerGrid, numBlocksPerSM, numGridsPerSM, numGrids); cudaDeviceSynchronize(); cudaMemcpy(u_arr, d_u, memsize_grid * numGrids, cudaMemcpyDeviceToHost); cudaMemcpy(du_arr, d_du, memsize_grid * numGrids, cudaMemcpyDeviceToHost); cudaMemcpy(r_arr, d_r, memsize_grid * numGrids, cudaMemcpyDeviceToHost); /* printf("\n\nafter kernel finished\n\n"); for(uint k=0; k<numGrids; ++k) printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/ equal = true; for(int k=1; k<numGrids; ++k) { equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid); equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid); equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid); } if(!equal) printf("Results are wrong!!\n"); else printf("All went well!\n"); cudaFree(d_u); cudaFree(d_du); cudaFree(d_r); delete [] u_arr; delete [] du_arr; delete [] r_arr; return 0; }

Answer 1

请注意，在整篇文章中，我没有看到明确询问的问题，因此我回复：

我期待着了解这里发生了什么。

d_u上有竞争条件。

你自己的陈述：

•为了使块彼此独立，引入网格上的小重叠（每个网格的网格点666,667,668,669由来自不同块的两个线程读取，但只有一个线程正在写信给他们，这是问题发生的重叠）

此外，如果你注释掉d_u的写入，根据你在代码中的陈述，问题就会消失。

CUDA线程块可以按任何顺序执行。您有至少2个不同的块从网格点666,667,668,669读取。结果将根据实际发生的情况而有所不同：

两个块在发生任何写入之前读取该值。
一个块读取该值，然后发生写入，然后另一个块读取该值。

如果一个块正在读取可由另一个块写入的值，则这些块不是彼此独立的（与您的语句相反）。在这种情况下，块执行的顺序将决定结果，而CUDA不指定块执行的顺序。

请注意cuda-memcheck -tool racecheck选项only captures race conditions related to __shared__ memory usage __shared__。发布的内核不使用cuda-memcheck内存，因此我不希望cuda-memcheck报告任何内容。

printf，为了收集数据，会影响块执行的顺序，因此它会影响行为并不奇怪。

in-kernel {{1}}代表一个代价高昂的函数调用，写入全局内存缓冲区。所以它也会影响执行行为/模式。如果要打印大量数据，超出输出的缓冲行，则在缓冲区溢出的情况下，效果非常高（就执行时间而言）。

另外，据我所知，Linux Mint是not a supported distro for CUDA。但是我不认为这与你的问题有关;我可以在受支持的配置上重现该行为。

Heisenbug在CUDA内核中，全局内存访问

1 个答案: