我开始学习Cuda的日记。我正在玩一些你好的世界型cuda代码,但它不起作用,我不知道为什么。
代码非常简单,需要两个整数并将它们添加到GPU上并返回结果,但无论我将数字更改为什么,我都得到相同的结果(如果数学以这种方式工作,我会做得更好这个主题比我实际做的还多。)
以下是示例代码:
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
*c = a + b;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaMalloc( (void**)&dev_c, sizeof(int) );
add<<<1,1>>>( 1, 4, dev_c );
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
输出似乎有些偏差:1 + 4 = -1065287167
我正在设置我的环境,只是想知道代码是否有问题,否则可能是我的环境。
更新:我试图添加一些代码来显示错误,但是我没有得到输出但是数字改变了(是输出错误代码而不是答案?即使我在kernal中没有做任何工作其他比分配变量我仍然得到simlair结果。)
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
//*c = a + b;
*c = 5;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
代码似乎没问题,可能与我的设置有关。将Cuda安装在OSX Lion上是一场噩梦,但我认为它有用,因为SDK中的示例似乎没问题。到目前为止,我采取的步骤是访问Nvida网站并下载驱动程序,工具包和SDK的最新mac版本。然后我添加了export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH
和'PATH = / usr / local / cuda / bin:$ PATH`我做了一个deviceQuery,它传递了关于我的系统的以下信息:
[deviceQuery] starting...
/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED
更新:真正奇怪的是,即使我删除了内核中的所有工作,我仍然得到c的结果?我重新安装了cuda并在示例中使用了make,并且所有这些都通过了。
答案 0 :(得分:8)
基本上这里有两个问题:
在运行时API中,大多数与上下文相关的操作都是“懒惰地”执行的。当您第一次启动内核时,运行时API将调用代码以从工具链为目标硬件发出的胖二进制映像内部智能地查找合适的CUBIN映像,并将其加载到上下文中。这还可以包括针对向后兼容架构的PTX的JIT重新编译,但不是相反。因此,如果您为计算能力1.2设备编译了内核并在计算能力2.0设备上运行它,则驱动程序可以JIT编译它包含的新架构的PTX 1.x代码。但反过来不起作用。因此,在您的示例中,运行时API将生成错误,因为它无法在可执行文件中嵌入的CUDA fatbinary映像中找到可用的二进制映像。错误消息非常神秘,但您会收到错误(有关详细信息,请参阅this question)。
如果您的代码包含错误检查,请执行以下操作:
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
内核启动后的额外错误检查应该捕获内核加载/启动失败所产生的运行时API错误。
答案 1 :(得分:1)
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void Addition(int *a,int *b,int *c)
{
*c = *a + *b;
}
int main()
{
int a,b,c;
int *dev_a,*dev_b,*dev_c;
int size = sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a=5,b=6;
cudaMemcpy(dev_a, &a,sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, &b,sizeof(int), cudaMemcpyHostToDevice);
Addition<<< 1,1 >>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c, dev_c,size, cudaMemcpyDeviceToHost);
cudaFree(&dev_a);
cudaFree(&dev_b);
cudaFree(&dev_c);
printf("%d\n", c);
getch();
return 0;
}