Question

我正在尝试探索'__ldg内在'。我已经通过了NVIDIA的文档，但没有得到任何关于其使用和实现的满意答案。此外，参考THIS我尝试在一个简单的1024 * 1024矩阵乘法示例中实现__ldg。

#include<stdio.h>
#include<stdlib.h>

__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
        float pvalue=0;
        //find Row and Column corresponding to a data element for each thread
        int Row = blockIdx.y * blockDim.y + threadIdx.y;
        int Col = blockIdx.x * blockDim.x + threadIdx.x;
        //calculate dot product of Row of First Matrix and Column of Second Matrix
        for(int i=0;i< N;++i)
        {
//   I tried with executing this first:
            float m=__ldg(&ad[Row * N+i]);
            float n=__ldg(&bd[i * N + Col]);

//Then I executed this as a normal execution:
//          float m = ad[Row * N+i];
//          float n = bd[i * N + Col];

            pvalue += m * n;
         }
        //store dot product at corresponding position in resultant Matrix
        cd[Row * N + Col] = pvalue;
}

int main()
{
    int N = 1024,i,j;               //N == size of square matrix

    float *a,*b;
    float *ad,*bd,*cd,*c;

    //open a file for outputting the result
    FILE *f;
    f=fopen("Parallel Multiply_ldg.txt","w");

    size_t size=sizeof(float)* N * N;

    //allocate host side memory
    a=(float*)malloc(size);
    b=(float*)malloc(size);
    c=(float*)malloc(size);

    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
        {
            a[i*N+j]=2.0;   //(float)(i*N+j);       //initializing each value with its own index
            b[i*N+j]=1.0;   //(float)(i*N+j);       //random functions can be used alternatively
        }
    }

    //allocate device memory
    cudaMalloc(&ad,size);
    //printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&bd,size);
    //printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&cd,size);
    //printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //copy value from host to device
    cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
    cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);

    printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //calculate execution configuration
    dim3 blocksize(16,16);              //each block contains 16 * 16 (=256) threads
    dim3 gridsize(N/16,N/16);           //creating just sufficient no of blocks

    //GPU timer code
    float time;
    cudaEvent_t start,stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start,0);

    matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
    cudaDeviceSynchronize();
    cudaEventRecord(stop,0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time,start,stop);         //time taken in kernel call calculated
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    //copy back results
    cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);

    printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //output results in output_file
    fprintf(f,"Array A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",a[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nArray B was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",b[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nMultiplication of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",c[i*N+j]);              //if correctly computed, then all values must be N
        fprintf(f,"\n");
    }
    printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
    printf("\n\nTime taken is %f (ms)\n",time);
    fprintf(f,"\n\nTime taken is %f (ms)\n",time);
    fclose(f);

    cudaThreadExit();
    //cudaFree(ad); cudaFree(bd); cudaFree (cd);
    free(a);free(b);free(c);
    //_getch();
    return 1;
}

我评论说__ldg部分在我的内核中并通过正常执行执行，反之亦然。在这两种情况下，它都给出了正确的乘法结果。我对这些执行之间的时差感到困惑，因为它的巨大差不多超过100倍！

如果是__ldg，它会给我：Time taken is 0.014432 (ms)

如果没有__ldg正常执行，它会给我：Time taken is 36.858398 (ms)

这是使用__ldg内在的确切方法吗？ __ldg内在的重要性是什么，使用它的正确方法是什么？显然我在上面的代码中所做的是错误的和幼稚的。我正在寻找解释和例子。提前谢谢。

Answer 1

来自CUDA C Programming Guide

计算能力3.x的设备的全局内存访问缓存在L2中，对于计算能力3.5的设备，也可以缓存在上一节中描述的只读数据缓存中;它们不会缓存在L1中。

...

内核的整个生命周期内的只读数据也可以通过使用__ldg()函数读取它来缓存在上一节中描述的只读数据缓存中（请参阅只读数据缓存）加载功能）。当编译器检测到某些数据满足只读条件时，它将使用__ldg()来读取它。编译器可能无法始终检测到某些数据满足只读条件。标记用于使用const和__restrict__限定符加载此类数据的指针会增加编译器检测只读条件的可能性。

只读缓存访问的延迟远低于全局内存访问。因为矩阵乘法多次从内存中访问相同的值，所以只读缓存中的缓存会带来巨大的加速（在内存绑定应用程序中）。

Answer 2

在NVIDIA GPU中有一个纹理-具有特殊逻辑且不难处理图像的图像。

此纹理内存是GPU中可用的另一种内存类型。在特别恒定的情况下，全局和寄存器文件内存与该纹理内存没有任何关系。

Kepler GPU和更高版本增加了通过“ GPU纹理管道”使用此内存的功能。

但是，让我们指定常量缓存和只读缓存之间的区别。

常量缓存

通过常量缓存加载的数据必须相对较小，并且必须以这样的方式进行访问：经线的所有线程应在任何给定时间访问同一位置。

只读缓存或纹理内存缓存

缓存可能更大，并且可以以非均匀模式进行访问。只读缓存的粒度为32个字节。

您可以将其用作CUDA内核的“只读缓存”。

1. Data stored in global memory can be cached in that place GPU Texture Memory
2. With doing that you give promise to the compiler that data is read-only for the 
   duration of a kernel execution in GPU.

有两种方法可以实现这一目标。

A。使用内在函数__ldg

Example: output[i] += __ldg(&input[j]);

B。限定指向全局内存的指针

const float* __restrict__ input
output[idx] += input[idx];

比较：

由于深层的编译器原因，内在的__ldg是更好的选择。

__ldg（）内在函数和正常执行之间有什么区别？

2 个答案: