Question

我想在每个维度中发送大小为src的3D数组size，将其展平为大小为length = size * size * size的1D数组，放入内核，计算结果并存储它在dst。但是，最后，dst不正确地包含全0。这是我的代码：

int size = 256;
int length = size * size * size;
int bytes = length * sizeof(float);

// Allocate source and destination arrays on the host and initialize source array

float *src, *dst;
cudaMallocHost(&src, bytes);
cudaMallocHost(&dst, bytes);
for (int i = 0; i < length; i++) {
    src[i] = i;
}

// Allocate source and destination arrays on the device

struct cudaPitchedPtr srcGPU, dstGPU;
struct cudaExtent extent = make_cudaExtent(size*sizeof(float), size, size);
cudaMalloc3D(&srcGPU, extent);
cudaMalloc3D(&dstGPU, extent);

// Copy to the device, execute kernel, and copy back to the host

cudaMemcpy(srcGPU.ptr, src, bytes, cudaMemcpyHostToDevice);
myKernel<<<numBlocks, blockSize>>>((float *)srcGPU.ptr, (float *)dstGPU.ptr);
cudaMemcpy(dst, dstGPU.ptr, bytes, cudaMemcpyDeviceToHost);

为清楚起见，我遗漏了对cudaMallocHost()，cudaMalloc()和cudaMemcpy()的错误检查。在任何情况下，此代码都不会触发错误。

cudaMalloc3D()与cudaMemcpy()的正确用法是什么？

请告诉我是否应该为内核发布最小测试用例，或者如果问题可以在上面的代码中找到。

Answer 1

编辑：如果使用CUDA数组，则范围采用元素数量，但如果不使用CUDA数组（例如，使用cudaMalloc的某些非数组变体分配的内存），则有效地获取字节数

来自the Runtime API CUDA documentation：

范围字段定义元素中传输区域的尺寸。如果CUDA数组正在参与副本，则范围是根据该数组的元素定义的。如果没有CUDA数组参与副本，那么范围在 unsigned char
的元素中定义

此外，cudaMalloc3D会返回 pitched 指针，这意味着它至少具有所提供范围的尺寸，但可能更多因为对齐原因。在访问和复制设备内存时，您必须考虑此音调。有关cudaPitchedPtr结构

的文档，请参阅here

至于将cudaMalloc3D与cudaMemcpy一起使用，您可能需要查看使用cudaMemcpy3D（documentation here），它可能会让您的生活更轻松考虑到主机和设备内存的间距。要使用cudaMemcpy3D，您必须使用适当的信息创建cudaMemcpy3DParms结构。它的成员是：

cudaArray_t dstArray
struct cudaPos dstPos
struct cudaPitchedPtr dstPtr
struct cudaExtent extent
enumcudaMemcpyKind kind
cudaArray_t srcArray
struct cudaPos srcPos
struct cudaPitchedPtr srcPtr

并且您必须指定一个srcArray 或 srcPtr以及dstArray 或 dstPtr中的一个。此外，文档建议在使用之前将结构初始化为0，例如 cudaMemcpy3DParms myParms = {0};

另外，您可能有兴趣看一下这个other SO question

正确使用cudaMalloc3D和cudaMemcpy

1 个答案: