这是在带有GeForce 320M(计算能力1.2)的MacBookPro7,1上。以前,使用OS X 10.7.8,XCode 4.x和CUDA 5.0,编译并运行CUDA代码。
然后,我更新到OS X 10.9.2,XCode 5.1和CUDA 5.5。起初,deviceQuery
失败了。我在其他地方读到5.5.28(驱动程序CUDA 5.5附带)不支持计算能力1.x(sm_10),但5.5.43确实如此。在将CUDA驱动程序更新为更新的当前5.5.47(GPU驱动程序版本8.24.11 310.90.9b01)后,deviceQuery
确实通过了以下输出。
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors, ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS
此外,我可以成功编译而无需修改CUDA 5.5样本,但我还没有尝试编译所有这些样本。
但是,matrixMul
,simpleCUFFT
,simpleCUBLAS
等样本在运行时都会立即失败。
$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)
$ ./simpleCUFFT
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"
错误代码2是cudaErrorMemoryAllocation
,但我怀疑它以某种方式隐藏了失败的CUDA初始化。
$ ./simpleCUBLAS
GPU Device 0: "GeForce 320M" with compute capability 1.2
simpleCUBLAS test running..
!!!! CUBLAS initialization error
实际错误代码是从调用cublasCreate()
返回CUBLAS_STATUS_NOT_INITIALIZED。
之前是否有人遇到此问题并找到修复程序?提前谢谢。
答案 0 :(得分:2)
我猜你的内存不足了。您的GPU正由显示管理器使用,它只有256Mb的RAM。 OS 10.9显示管理器和CUDA 5.5运行时的内存占用总量可能会让您几乎没有空闲内存。我建议编写并运行一个这样的小测试程序:
#include <iostream>
int main(void)
{
size_t mfree, mtotal;
cudaSetDevice(0);
cudaMemGetInfo(&mfree, &mtotal);
std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;
return cudaDeviceReset();
}
[免责声明:用浏览器编写,永远不会自行编译或测试使用]
这应该会为您提供设备上下文建立后可用空闲内存的图片。您可能会对可以使用的内容感到惊讶。
编辑:这是一个更轻量级的替代测试,它甚至不会尝试在设备上建立上下文。相反,它仅使用驱动程序API来检查设备。如果成功,则OS X的运行时API发送会以某种方式中断,或者设备上没有可用于建立上下文的内存。如果它失败了,那么你真的有一个破碎的CUDA安装。不管怎样,我会考虑用NVIDIA打开一个错误报告:
#include <iostream>
#include <cuda.h>
int main(void)
{
CUdevice d;
size_t b;
cuInit(0);
cuDeviceGet(&d, 0);
cuDeviceTotalMem(&b, d);
std::cout << "Total memory = " << b << std::endl;
return 0;
}
请注意,您需要明确链接cuda驱动程序库才能使其正常工作(例如,将-lcuda传递给nvcc)