Question

我开始学习Cuda的日记。我正在玩一些你好的世界型cuda代码，但它不起作用，我不知道为什么。

代码非常简单，需要两个整数并将它们添加到GPU上并返回结果，但无论我将数字更改为什么，我都得到相同的结果（如果数学以这种方式工作，我会做得更好这个主题比我实际做的还多。）

以下是示例代码：

// CUDA-C includes
#include <cuda.h>
#include <stdio.h>

__global__ void add( int a, int b, int *c ) {
    *c = a + b;
}

extern "C"
void runCudaPart();

// Main cuda function

void runCudaPart() {

    int c;
    int *dev_c;

    cudaMalloc( (void**)&dev_c, sizeof(int) );
    add<<<1,1>>>( 1, 4, dev_c );

    cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );

    printf( "1 + 4 = %d\n", c );
    cudaFree( dev_c );

}

输出似乎有些偏差：1 + 4 = -1065287167

我正在设置我的环境，只是想知道代码是否有问题，否则可能是我的环境。

更新：我试图添加一些代码来显示错误，但是我没有得到输出但是数字改变了（是输出错误代码而不是答案？即使我在kernal中没有做任何工作其他比分配变量我仍然得到simlair结果。）

// CUDA-C includes
#include <cuda.h>
#include <stdio.h>

__global__ void add( int a, int b, int *c ) {
    //*c = a + b;
    *c = 5;
}

extern "C"
void runCudaPart();

// Main cuda function

void runCudaPart() {

    int c;
    int *dev_c;

    cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
    if(err != cudaSuccess){
         printf("The error is %s", cudaGetErrorString(err));
    }
    add<<<1,1>>>( 1, 4, dev_c );

    cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
    if(err2 != cudaSuccess){
         printf("The error is %s", cudaGetErrorString(err));
    }


    printf( "1 + 4 = %d\n", c );
    cudaFree( dev_c );

}

代码似乎没问题，可能与我的设置有关。将Cuda安装在OSX Lion上是一场噩梦，但我认为它有用，因为SDK中的示例似乎没问题。到目前为止，我采取的步骤是访问Nvida网站并下载驱动程序，工具包和SDK的最新mac版本。然后我添加了export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH和'PATH = / usr / local / cuda / bin：$ PATH`我做了一个deviceQuery，它传递了关于我的系统的以下信息：

[deviceQuery] starting...

/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "GeForce 320M"
  CUDA Driver Version / Runtime Version          4.2 / 4.2
  CUDA Capability Major/Minor version number:    1.2
  Total amount of global memory:                 253 MBytes (265027584 bytes)
  ( 6) Multiprocessors x (  8) CUDA Cores/MP:    48 CUDA Cores
  GPU Clock rate:                                950 MHz (0.95 GHz)
  Memory Clock rate:                             1064 Mhz
  Memory Bus Width:                              128-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   No
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED

更新：真正奇怪的是，即使我删除了内核中的所有工作，我仍然得到c的结果？我重新安装了cuda并在示例中使用了make，并且所有这些都通过了。

Answer 1

基本上这里有两个问题：

您没有为正确的架构编译内核（从评论中收集）
您的代码包含不完整的错误检查，当运行时错误发生时，该错误检查错过了这一点，导致出现神秘且无法解释的症状。

在运行时API中，大多数与上下文相关的操作都是“懒惰地”执行的。当您第一次启动内核时，运行时API将调用代码以从工具链为目标硬件发出的胖二进制映像内部智能地查找合适的CUBIN映像，并将其加载到上下文中。这还可以包括针对向后兼容架构的PTX的JIT重新编译，但不是相反。因此，如果您为计算能力1.2设备编译了内核并在计算能力2.0设备上运行它，则驱动程序可以JIT编译它包含的新架构的PTX 1.x代码。但反过来不起作用。因此，在您的示例中，运行时API将生成错误，因为它无法在可执行文件中嵌入的CUDA fatbinary映像中找到可用的二进制映像。错误消息非常神秘，但您会收到错误（有关详细信息，请参阅this question）。

如果您的代码包含错误检查，请执行以下操作：

cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
     printf("The error is %s", cudaGetErrorString(err));
}

add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
    printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}

cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
     printf("The error is %s", cudaGetErrorString(err));
}

内核启动后的额外错误检查应该捕获内核加载/启动失败所产生的运行时API错误。

Answer 2

#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>


__global__ void Addition(int *a,int *b,int *c)
{

   *c = *a + *b;
}
int main()
{
  int a,b,c;
  int *dev_a,*dev_b,*dev_c;
  int size = sizeof(int);

  cudaMalloc((void**)&dev_a, size);
  cudaMalloc((void**)&dev_b, size);
  cudaMalloc((void**)&dev_c, size);

  a=5,b=6;

  cudaMemcpy(dev_a, &a,sizeof(int), cudaMemcpyHostToDevice);  
  cudaMemcpy(dev_b, &b,sizeof(int), cudaMemcpyHostToDevice);  

  Addition<<< 1,1 >>>(dev_a,dev_b,dev_c);
  cudaMemcpy(&c, dev_c,size, cudaMemcpyDeviceToHost);

   cudaFree(&dev_a);
   cudaFree(&dev_b);
   cudaFree(&dev_c);

   printf("%d\n", c);
   getch();
   return 0;
}

在Cuda中简单地添加两个int，结果总是一样的

2 个答案: