GPU计算导致驱动程序错误“停止响应”

时间:2014-02-23 17:11:26

标签: matlab cuda parallel-processing nvidia matlab-gpu

我在这里有一个小的无意义的脚本,我在MATLAB R2013b中执行:

clear all;

n = 2000;
times = 50;
i = 0;

tCPU = tic;

disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
    CPU = A * B;
end

tCPU = toc(tCPU);
tGPU = tic;

disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
    GPU =  A * B ; 
end
tGPU = toc(tGPU);

fprintf('On CPU: %.2f sec\nOn GPU: %.2f sec\n', tCPU, tGPU);

不幸的是,执行后我收到来自Windows的消息说:“显示驱动程序停止工作并已恢复。”。

enter image description here

我认为这意味着Windows没有得到我的显卡驱动程序或其他东西的响应。返回的脚本没有错误:

>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec

但无论GPU是否内存不足,MATLAB都无法在重新启动之前使用GPU设备。如果我不重新启动MATLAB,我只收到来自CUDA的消息:

>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated

Error in test (line 21)
A = gpuArray(A);

有人知道如何避免这个问题或我在这里做错了吗?

如果需要,我的GPU设备:

>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'GeForce GTX 660M'
                     Index: 1
         ComputeCapability: '3.0'
            SupportsDouble: 1
             DriverVersion: 6
            ToolkitVersion: 5
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 2.1475e+09
                FreeMemory: 1.9037e+09
       MultiprocessorCount: 2
              ClockRateKHz: 950000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

1 个答案:

答案 0 :(得分:5)

关键信息是gpuDevice输出的这一部分:

KernelExecutionTimeout: 1

这意味着主机显示驱动程序在运行计算作业的GPU上处于活动状态。 NVIDIA显示驱动程序包含一个看门狗定时器,可以杀死任何需要超过预定义时间的任务,而不会将控制权交还给驱动程序以进行屏幕刷新。这旨在防止长时间运行或卡住的计算作业通过冻结显示器使机器无响应的情况。 Matlab脚本的运行时间明显超过显示驱动程序监视程序计时器限制。一旦发生这种情况,设备上保存的计算上下文将被破坏,Matlab将无法再与设备一起运行。您可以通过调用reset来重新初始化上下文,我想这将在封面下运行cudaDeviceReset()

在interweb上有很多关于这个看门狗定时器的信息 - 例如this Stack Overflow question。如何修改此超时的解决方案取决于您的操作系统和硬件。避免这种情况的最简单方法是不在显示GPU上运行CUDA代码,或者增加计算作业的粒度,以便没有一个操作的运行时超过超时限制。或者只是编写更快的代码......