Question

我已经写了很长一段时间的CUDA代码了，但我现在正在快速了解如何使用纹理缓存。

使用Nvidia SDK中的simpleTexture示例获取灵感，我编写了一个使用纹理缓存的简单示例。主机将Lena图像复制到GPU并将其绑定为纹理。内核只是将纹理缓存的内容复制到输出数组中。

奇怪的是，结果（参见代码下方的全灰色图像）与输入不匹配。 对可能出现的问题有任何疑问？

代码（查看texCache_dummyKernel）：

texture<float, 2, cudaReadModeElementType> tex; //declare texture reference for 2D float texture

//note: tex is global, so no input ptr is needed
__global__ void texCache_dummyKernel(float* out, const int width, const int height){ //copy tex to output
    int x = blockIdx.x*blockDim.x + threadIdx.x; //my index into "big image"
    int y = blockIdx.y*blockDim.y + threadIdx.y;
    int idx = y*width+x;

    if(x < width && y < height)
        out[idx] = tex2D(tex, y, x);
}

int main(int argc, char **argv){        
    cv::Mat img = getRawImage("./Lena.pgm");
    img.convertTo(img, CV_32FC1);
    float* hostImg = (float*)&img.data[0];
    int width = img.cols; int height = img.rows;

    dim3 grid;  dim3 block;
    block.x = 16;  block.y = 16;
    grid.x = width/block.x + 1;          
    grid.y = height/block.y + 1;

    cudaArray *dImg; //cudaArray*, not float*
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);        
    CHECK_CUDART(cudaMallocArray(&dImg, &channelDesc, width, height));
    CHECK_CUDART(cudaMemcpyToArray(dImg, 0, 0, hostImg, width*height*sizeof(float), cudaMemcpyHostToDevice));
    setTexCacheParams(); //defined below
    CHECK_CUDART(cudaBindTextureToArray(tex, dImg, channelDesc)); //Bind the array to the texture

    float* dResult; //device memory for output
    CHECK_CUDART(cudaMalloc((void**)&dResult, sizeof(float)*width*height));

    texCache_dummyKernel<<<grid, block>>>(dResult, width, height); //dImg isn't an input param, since 'tex' is a global variable
    CHECK_CUDART(cudaGetLastError()); //make sure kernel didn't crash

    float* hostResult = (float*)malloc(sizeof(float)*width*height);
    CHECK_CUDART(cudaMemcpy(hostResult, dResult, sizeof(float)*width*height, cudaMemcpyDeviceToHost));
    outputProcessedImage(hostResult, width, height, "result.png"); //defined below
}

我应该提供一些我上面使用过的辅助函数：

void setTexCacheParams(){ //configuration directly pulled from simpleTexture in nvidia sdk
    tex.addressMode[0] = cudaAddressModeWrap;
    tex.addressMode[1] = cudaAddressModeWrap;
    tex.filterMode = cudaFilterModeLinear;
    tex.normalized = true;    // access with normalized texture coordinates
}

void outputProcessedImage(float* processedImg, int width, int height, string out_filename){
    cv::Mat img = cv::Mat::zeros(height, width, CV_32FC1);
    for(int i=0; i<height; i++)
        for(int j=0; j<width; j++)
            img.at<float>(i,j) = processedImg[i*width + j]; //just grab the 1st of the 4 pixel spaces in a uchar4

    img.convertTo(img, CV_8UC1); //float to uchar
    vector<int> compression_params;
    compression_params.push_back(CV_IMWRITE_PNG_COMPRESSION);
    compression_params.push_back(9);
    cv::imwrite(out_filename, img, compression_params);
}

输入：

enter image description here

输出：

enter image description here

很抱歉，这篇文章就是这样的代码墙！我很感激任何关于使这样的代码更简洁的建议。
我使用OpenCV作为上面的文件I / O ...希望这不会令人困惑。
当我更改内核以从1D float*数组中读取输入图像时，我保持其他所有内容相同，我得到了正确的结果。

Answer 1

在原始代码中，您已初始化纹理以使用规范化坐标。这意味着纹理在每个空间维度的[0,1]上被寻址。所以你的内核应该这样看：

__global__ 
void texCache_dummyKernel(float* out, const int width, const int height)
{
    int x = blockIdx.x*blockDim.x + threadIdx.x; //my index into "big image"
    int y = blockIdx.y*blockDim.y + threadIdx.y;
    int idx = y*width+x;

    if(x < width && y < height) {
        float u = float(x)/float(width), v = float(y)/float(height);
        out[idx] = tex2D(tex, u, v);
    }
}

[标准免责声明：用浏览器编写，未经编译或测试，使用风险自负]

即。你应该将坐标传递给tex2D，通过图像的宽度和高度进行标准化。

或者，正如您所发现的，您可以将纹理定义更改为normalized=false并使用绝对而非相对纹理坐标中的寻址。即便如此，代码中读取的纹理应如下所示：

out[idx] = tex2D(tex, float(x)+0.5f, float(y)+0.5f);

因为纹理寻址总是使用浮点坐标完成，纹理数据是体素居中，所以每个坐标都加0.5，以确保读取来自每个插值区域或体积内的质心纹理。

您可以在CUDA C编程指南的一个附录中找到纹理过滤和寻址模式的描述及其对插值的影响。

CUDA Texture Cache似乎有错误的数据？

1 个答案: