由于我正在编程的算法的性质,我需要用一些特定的数学来编写/填充3D矩阵,然后从该矩阵(在单独的内核中)读取作为3D线性插值纹理。
由于纹理是一种读取模式,我假设我可以以某种方式写入全局内存绑定到纹理,并在单独读取它,而不需要双重内存并将值从写入复制到读取矩阵。但是我似乎并不知道如何做到这一点。
我的问题是我不知道如何定义这个全局读/写数组。在下面的示例中,我创建了一个3D纹理,但这是使用cudaExtent
和cudaArray
的代码。但是我似乎无法使用这些类型来编写它们,我似乎也无法用float*
或类似的东西创建它们。
我可能无法做到这一点并且需要在中间某处memcpy
,但由于这些数组通常都很大,我想节省内存。
示例代码(不编译,但明确定义了我要做的事情的结构)。默认情况下使用100x100x100 3D内存,因为是。
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cuda_runtime_api.h>
#include <cuda.h>
#define MAXTREADS 1024
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
texture<float, cudaTextureType3D, cudaReadModeElementType> tex;
__global__ void readKernel(float* imageend )
{
int indY = blockIdx.y * blockDim.y + threadIdx.y;
int indX = blockIdx.x * blockDim.x + threadIdx.x;
int indZ = blockIdx.z * blockDim.z + threadIdx.z;
//Make sure we dont go out of bounds
size_t idx = indZ * 100 * 100 + indY * 100 + indX;
if (indX >= 100 | indY >= 100 | indZ >= 100)
return;
imageend[idx] = tex3D(tex, indX + 0.5, indY + 0.5, indZ + 0.5);
}
__global__ void writeKernel(float* imageaux){
int indY = blockIdx.y * blockDim.y + threadIdx.y;
int indX = blockIdx.x * blockDim.x + threadIdx.x;
int indZ = blockIdx.z * blockDim.z + threadIdx.z;
//Make sure we dont go out of bounds
size_t idx = indZ * 100 * 100 + indY * 100 + indX;
if (indX >= 100 | indY >= 100 | indZ >= 100)
return;
imageaux[idx] = (float)idx;
}
int main()
{
cudaArray *d_image_aux= 0;
const cudaExtent extent = make_cudaExtent(100, 100, 100);
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaMalloc3DArray(&d_image_aux, &channelDesc, extent);
// Configure texture options
tex.normalized = false;
tex.filterMode = cudaFilterModeLinear;
tex.addressMode[0] = cudaAddressModeBorder;
tex.addressMode[1] = cudaAddressModeBorder;
tex.addressMode[2] = cudaAddressModeBorder;
cudaBindTextureToArray(tex, d_image_aux, channelDesc);
float *d_image_end = 0;
size_t num_bytes = 100 * 100 * 100 * sizeof(float);
cudaMalloc((void**)&d_image_end, num_bytes);
cudaMemset(d_image_end, 0, num_bytes);
int divx, divy, divz; //Irrelevant for the demo, important for the main code
divx = 32;
divy = 32;
divz = 1;
dim3 grid((100 + divx - 1) / divx,
(100 + divy - 1) / divy,
(100 + divz - 1) / divz);
dim3 block(divx, divy, divz);
// Kernels
writeKernel << <grid, block >> >(d_image_aux);
readKernel << <grid, block >> >(d_image_end);
cudaUnbindTexture(tex);
cudaFree(d_image_aux);
cudaFree(d_image_end);
return 0;
}
注意:我知道我不能写“插入”或其他任何内容。写操作将始终在整数索引中,而读操作需要使用三线性插值。
答案 0 :(得分:3)
我相信展示内核写入3D表面(绑定到底层3D cudaArray)的所有必要部分,然后是来自相同数据的另一个内核纹理(即使用自动插值)(绑定到3D纹理的3D纹理)相同的基础3D cudaArray)包含在volumeFiltering CUDA sample code。
中唯一的概念差异是示例代码有2个不同的底层3D cudaArrays(一个用于纹理,一个用于表面)但我们可以将它们组合在一起,以便随后在纹理操作期间读取写入表面的数据。
这是一个功能齐全的例子:
$ cat texsurf.cu
#include <stdio.h>
#include <helper_cuda.h>
texture<float, cudaTextureType3D, cudaReadModeElementType> volumeTexIn;
surface<void, 3> volumeTexOut;
__global__ void
surf_write(float *data,cudaExtent volumeSize)
{
int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;
int z = blockIdx.z*blockDim.z + threadIdx.z;
if (x >= volumeSize.width || y >= volumeSize.height || z >= volumeSize.depth)
{
return;
}
float output = data[z*(volumeSize.width*volumeSize.height)+y*(volumeSize.width)+x];
// surface writes need byte offsets for x!
surf3Dwrite(output,volumeTexOut,x * sizeof(float),y,z);
}
__global__ void
tex_read(float x, float y, float z){
printf("x: %f, y: %f, z:%f, val: %f\n", x,y,z,tex3D(volumeTexIn,x,y,z));
}
void runtest(float *data, cudaExtent vol, float x, float y, float z)
{
// create 3D array
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaArray_t content;
checkCudaErrors(cudaMalloc3DArray(&content, &channelDesc, vol, cudaArraySurfaceLoadStore));
// copy data to device
float *d_data;
checkCudaErrors(cudaMalloc(&d_data, vol.width*vol.height*vol.depth*sizeof(float)));
checkCudaErrors(cudaMemcpy(d_data, data, vol.width*vol.height*vol.depth*sizeof(float), cudaMemcpyHostToDevice));
dim3 blockSize(8,8,8);
dim3 gridSize((vol.width+7)/8,(vol.height+7)/8,(vol.depth+7)/8);
volumeTexIn.filterMode = cudaFilterModeLinear;
checkCudaErrors(cudaBindSurfaceToArray(volumeTexOut,content));
surf_write<<<gridSize, blockSize>>>(d_data, vol);
// bind array to 3D texture
checkCudaErrors(cudaBindTextureToArray(volumeTexIn, content));
tex_read<<<1,1>>>(x, y, z);
checkCudaErrors(cudaDeviceSynchronize());
cudaFreeArray(content);
cudaFree(d_data);
return;
}
int main(){
const int dim = 8;
float *data = (float *)malloc(dim*dim*dim*sizeof(float));
for (int z = 0; z < dim; z++)
for (int y = 0; y < dim; y++)
for (int x = 0; x < dim; x++)
data[z*dim*dim+y*dim+x] = z*100+y*10+x;
cudaExtent vol = {dim,dim,dim};
runtest(data, vol, 1.5, 1.5, 1.5);
runtest(data, vol, 1.6, 1.6, 1.6);
return 0;
}
$ nvcc -I/usr/local/cuda/samples/common/inc texsurf.cu -o texsurf
$ cuda-memcheck ./texsurf
========= CUDA-MEMCHECK
x: 1.500000, y: 1.500000, z:1.500000, val: 111.000000
x: 1.600000, y: 1.600000, z:1.600000, val: 122.234375
========= ERROR SUMMARY: 0 errors
$
我不打算在这里提供有关线性纹理过滤的完整教程。这里有很多其他的示例问题,涵盖了索引和过滤的细节,它似乎并不是这个问题的症结所在。我已经选择了点(1.5,1.5,1.5)和(1.6,1.6,1.6)来轻松验证基础数据;结果对我有意义。