Question

我目前正在开展一个涉及CUDA的更全面的项目。在最近的几天里，我遇到了错误，我一直在拼命想要修复错误。但是，我无法弄清楚，所以我现在制作了一个显示相同行为的最小例子。我不得不说我对CUDA有点新意。我使用的是Visual Studio 2015和CUDA Toolkit 7.5。

该程序涉及在GPU内存上创建3D卷，然后计算值并将其写入卷。我试图让代码尽可能简单：

首先是main.cpp文件：

#include "cuda_test.h"

int main() {

    size_t const xDimension = 500;
    size_t const yDimension = 500;
    size_t const zDimension = 1000;

    //allocate volume part memory on gpu
    cudaPitchedPtr volume = ct::cuda::create3dVolumeOnGPU(xDimension, yDimension, zDimension);

    //start reconstruction
    ct::cuda::startReconstruction(volume,
                                  xDimension,
                                  yDimension,
                                  zDimension);

return 0;

}

然后是cuda_test.h，它是实际.cu文件的头文件：

#ifndef CT_CUDA
#define CT_CUDA

#include <cstdlib>
#include <stdio.h>
#include <cmath>

//CUDA
#include <cuda_runtime.h>

namespace ct {

    namespace cuda {

        cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize);
        void startReconstruction(cudaPitchedPtr volume,
                                 size_t xSize,
                                 size_t ySize,
                                 size_t zSize);

    }

}

#endif

然后是包含实际函数实现的cuda_test.cu文件：

#include "cuda_test.h"

namespace ct {

    namespace cuda {

        cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize) {
            cudaExtent extent = make_cudaExtent(xSize * sizeof(float), ySize, zSize);
            cudaPitchedPtr ptr;
            cudaMalloc3D(&ptr, extent);
            printf("malloc3D: %s\n", cudaGetErrorString(cudaGetLastError()));
            cudaMemset3D(ptr, 0, extent);
            printf("memset: %s\n", cudaGetErrorString(cudaGetLastError()));
            return ptr;
        }

        __device__ void addToVolumeElement(cudaPitchedPtr volumePtr, size_t ySize, size_t xCoord, size_t yCoord, size_t zCoord, float value) {
            char* devicePtr = (char*)(volumePtr.ptr);
            //z * xSize * ySize + y * xSize + x
            size_t pitch = volumePtr.pitch;
            size_t slicePitch = pitch * ySize;
            char* slice = devicePtr + zCoord*slicePitch;
            float* row = (float*)(slice + yCoord * pitch);
            row[xCoord] += value;
        }

        __global__ void reconstructionKernel(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) {

            size_t xIndex = blockIdx.x;
            size_t yIndex = blockIdx.y;
            size_t zIndex = blockIdx.z;

            if (xIndex == 0 && yIndex == 0 && zIndex == 0) {
                printf("kernel start\n");
            }

            //just make sure we're inside the volume bounds
            if (xIndex < xSize && yIndex < ySize && zIndex < zSize) {

                //float value = z;
                float value = sqrt(sqrt(sqrt(5.3))) * sqrt(sqrt(sqrt(1.2))) * sqrt(sqrt(sqrt(10.8))) + 501 * 0.125 * 0.786 / 5.3;

                addToVolumeElement(volumePtr, ySize, xIndex, yIndex, zIndex, value);

            }

            if (xIndex == 0 && yIndex == 0 && zIndex == 0) {
                printf("kernel end\n");
            }

        }

        void startReconstruction(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) {
            dim3 blocks(xSize, ySize, zSize);
            reconstructionKernel <<< blocks, 1 >>>(volumePtr,
                                                   xSize,
                                                   ySize,
                                                   zSize);
            printf("Kernel launch: %s\n", cudaGetErrorString(cudaGetLastError()));
            cudaDeviceSynchronize();
            printf("Device synchronise: %s\n", cudaGetErrorString(cudaGetLastError()));
        }

    }

}

函数create3dVolumeOnGPU分配一个3维＆＃34;卷＆＃34;在gpu内存中并返回一个指向它的指针。这是一个主机功能。第二个主机功能是startReconstruction。它唯一能做的就是启动实际内核，其中包含与体积中的体素一样多的块。内核函数是reconstructionKernel。它只计算一些常量中的任意值，然后调用addToVolumeElement（设备函数）将结果写入相应的体素（添加它）。

现在，问题是它崩溃了。如果我使用调试器（NSight）启动，NSight会中断错误消息：

CUDA grid launch failed: CUcontext: 2358451327088 CUmodule: 2358541519888 Function: _ZN2ct4cuda20reconstructionKernelE14cudaPitchedPtryyy

控制台输出：

malloc3D: no error
memset: no error
kernel started
kernel end

如果我在发布模式下启动，整个机器将重置。

但是，如果我将音量尺寸更改为更小，则可以使用，例如：

    size_t const xDimension = 100;
    size_t const yDimension = 100;
    size_t const zDimension = 100;

然而，免费GPU内存的数量不应该是问题（卡有4GB VRAM）。

如果有人可以查看它并且可能会给我一个可能导致问题的提示，那就太好了。

Answer 1

现在，问题是它崩溃了

如果有人可以查看它并且可能会给我一个可能导致问题的提示，那就太好了。

我认为你可能会遇到a WDDM TDR issue。在Windows上，只要在WDDM GPU上运行的内核执行时间超过2秒，您就可能遇到WDDM TDR监视程序（假设您没有对监视程序进行任何更改）。

此外，启动这样的内核：

reconstructionKernel <<< blocks, 1 >>>(...);

其中每个块的线程数为1，表示每个warp（以及每个块）中只有一个线程处于活动状态。但GPU喜欢每次扭曲有32个活动线程。因此净效应是GPU资源的低效利用;当你以这种方式运行内核时，可能多达97％的GPU马力空闲。

因此，如果您的代码足够灵活，可以允许：

reconstructionKernel <<< blocks, 1 >>>(...);

或等效地：

reconstructionKernel <<< blocks/256, 256 >>>(...);

（这只是一个有代表性的例子;我意识到你有一个多维网格，上面可能与你的情况不完全相关）

然后第二个调用方法几乎肯定会更有效率，导致同一工作的执行时间缩短。

所以我相信当你用每个块的多个线程测试你的代码时，你做了类似上面的事情，并且它将执行时间减少到TDR限制以下。

这是一个非常好的解决方案，但如果您最终为内核添加更多工作（更多总线程，或每个线程更多工作），那么您可能会再次遇到限制。在这种情况下，链接的文章解释了可能的解决方法。

另外，内核启动配置如下：

kernel<<<1, ?>>>(...);

或者这个：

kernel<<<?, 1>>>(...);

对于GPU上的高性能代码，

永远不会。

C ++：简单的CUDA卷重建代码崩溃

1 个答案: