原始问题

Question

<编辑：编辑：当我在读完这个问题之后，我想出来了。

问题的根源很可能是我没有分配足够的内存。我会尝试考虑这个并正确地做，然后回答我的问题。傻我。： - [它没有解释虽然没有出现在stdout中的经线......

原始问题

我在CUDA中创建了一个模板化内核，我在其中迭代全局内存中的灰度图像数据部分（共享内存优化是在我完成这项工作时），以实现具有圆盘形结构元素的形态操作。每个线程对应于图像的像素。当数据类型为char时，一切都按预期工作，我的所有线程都按照自己的意愿行事。当我将其更改为unsigned short时，它会开始执行并仅计算图像的上半部分。当我输入一些printfs（我的设备有2.0 CC）时，我发现应该运行的一些warp甚至都没有计算。

以下是相关代码。

在我的main.cpp中，我致电gcuda::ErodeGpuGray8(img, radius);和gcuda::ErodeGpuGray16(img, radius);，这些功能如下：

// gcuda.h
…
i3d::Image3d<i3d::GRAY8> ErodeGpuGray8(i3d::Image3d<i3d::GRAY8> img, const unsigned int radius);
i3d::Image3d<i3d::GRAY16> ErodeGpuGray16(i3d::Image3d<i3d::GRAY16> img, const unsigned int radius);
…

// gcuda.cu
…
// call this from outside
Image3d<GRAY8> ErodeGpuGray8(Image3d<GRAY8> img, const unsigned int radius) {
    return ErodeGpu<GRAY8>(img, radius);
}

// call this from outside
Image3d<GRAY16> ErodeGpuGray16(Image3d<GRAY16> img, const unsigned int radius) {
    return ErodeGpu<GRAY16>(img, radius);
}
…

我正在使用的库将GRAY8定义为char，将GRAY16定义为unsigned short。

以下是我调用内核的方法（blockSize是const int在相关命名空间中设置为128）：

// gcuda.cu
template<typename T> Image3d<T> ErodeGpu(Image3d<T> img, const unsigned int radius) {
    unsigned int width = img.GetWidth();
    unsigned int height = img.GetHeight();
    unsigned int w = nextHighestPower2(width);
    unsigned int h = nextHighestPower2(height);
    const size_t n = width * height;
    const size_t N = w * h;

    Image3d<T>* rslt = new Image3d<T>(img);
    T *vx = rslt->GetFirstVoxelAddr();

    // kernel parameters
    dim3 dimBlock( blockSize );
    dim3 dimGrid( ceil( N / (float)blockSize) );

    // source voxel array on device (orig)
    T *vx_d;

    // result voxel array on device (for result of erosion)
    T *vxr1_d;

    // allocate memory on device
    gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
    gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );

    gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
    gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );

    ErodeGpu<T><<<dimGrid, dimBlock>>>(vx_d, vxr1_d, n, width, radius);

    gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );

    // free device memory
    gpuErrchk( cudaFree( vx_d ) );
    gpuErrchk( cudaFree( vxr1_d ) );

    // for debug purposes
    rslt->SaveImage("../erodegpu.png");
    return rslt;
}

我的测试图像尺寸为82x82，因此n = 82 * 82 = 6724，N = 128 * 128 = 16384。

这是我的核心：

// gcuda.cu
// CUDA Kernel -- used for image erosion with a circular structure element of radius "erosionR"
template<typename T> __global__ void ErodeGpu(const T *in, T *out, const unsigned int n, const int width, const int erosionR)
{
    ErodeOrDilateCore<T>(ERODE, in, out, n, width, erosionR);
}

// The core of erosion or dilation. Operation is determined by the first parameter
template<typename T> __device__ void ErodeOrDilateCore(operation_t operation, const T *in, T *out, const unsigned int n, const int width, const int radius) {
    // get thread number, this method is overkill for my purposes but generally should be bulletproof, right?
    int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
    int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;
    int tx = threadId;

    if (tx >= n) {
        printf("[%d > %d]", tx, n);
        return;
    } else {
        printf("{%d}", tx);
    }

    … (erosion implementation, stdout is the same when this is commented out so it's probably not the root of the problem)
}

据我了解，此代码应该将一组随机排序的[X > N]和{X}字符串写入stdout，其中X =线程ID，并且应该有n卷曲括号的数字（即其余的索引＆lt; n}和N - n的线程的输出，但是当我运行它并使用正则表达式计算卷曲括号的数字时，我发现我只得到256他们此外，它们似乎发生在32个成员组中，这告诉我一些经线运行而另一些则没有。

我对此感到非常困惑。当我没有评论侵蚀实施部分时，GRAY8侵蚀工作和GRAY16侵蚀没有，即使stdout输出在两种情况下完全相同（可能依赖于输入，我也没有帮助）只用2张图片试过这个。）

我错过了什么？可能是什么原因造成的？我是否有一些内存管理错误，或者某些warp不运行并且侵蚀的东西可能只是图像库中只出现GRAY16类型的错误？

Answer 1

所以这只是一个愚蠢的malloc错误。

而不是

const size_t n = width * height;
const size_t N = w * h;

我用过

const int n = width * height;
const int N = w * h;

而不是错误的

gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );

gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );

…

gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );

我用过

gpuErrchk( cudaMalloc( (void**)&vx_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n * sizeof(T), cudaMemcpyHostToDevice ) );

gpuErrchk( cudaMalloc( (void**)&vxr1_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n * sizeof(T), cudaMemcpyDeviceToDevice ) );

…

gpuErrchk( cudaMemcpy( vx, vxr1_d, n * sizeof(T), cudaMemcpyDeviceToHost ) );

现在侵蚀工作正常，这是我试图解决的主要问题。我仍然没有得到我期待的stdout输出，所以如果有人能够对此有所了解，请这样做。

CUDA - 简单的代码，但我的一些经线不运行

原始问题

1 个答案: