Question

我的应用程序在运行约2小时后在nvcuda.dll中崩溃时出现了一个奇怪的问题。在花了很多时间试图调试这个问题之后，我想我已经知道发生了什么，但我想知道是否还有其他人看到过这个问题。

我的应用程序在非默认流中启动其大部分内核，此过程可以在需要使用默认流之前持续数小时。一切都工作正常，直到我将驱动程序从320个版本升级到最近的332.50（K40m）版本。现在发生的事情是，如果应用程序运行大约2个小时然后进行任何使用默认流的调用，那么它在nvcuda.dll内部的调用期间崩溃。起初我认为我的内核出了问题，但即使我使用了一些基本的东西，比如cudaMemcpy（使用默认流），它也会发生。当应用程序运行时，例如1小时或1.5小时，不会发生崩溃。我花了一段时间才意识到驱动程序可能存在问题所以我卸载了新驱动程序并安装了旧驱动程序（320.92），问题就消失了！我重复了相同的过程（更改驱动程序，重新启动，然后再次运行应用程序）多次，并且100％重复。

不幸的是，我没有一个小型，独立的复制品，但在我尝试创建一个之前，最近有没有人看过类似的东西？事故查看器在崩溃时的条目并没有多说：

Faulting application name: <app>.exe, version: <version>, time stamp: 0x5316a970
Faulting module name: nvcuda.dll, version: 8.17.13.3250, time stamp: 0x52e1fa40
Exception code: 0xc00000fd
Fault offset: 0x00000000002226e7
Faulting process id: 0x1558
Faulting application start time: 0x01cf3831a2f3b71b
Faulting application path: <app>.exe
Faulting module path: C:\windows\SYSTEM32\nvcuda.dll
Report Id: aceb9a51-a433-11e3-9403-90b11c4725be
Faulting package full name: 
Faulting package-relative application ID:

更新1 ：我现在有一个简单的应用程序，可以在K20m和K40m卡上重现崩溃。 更新2 ：更新了示例应用程序，能够重现崩溃。在调用堆栈中，看起来nvcuda.dll中存在堆栈溢出。

步骤：

在机器上安装最新版本（332.50）的驱动程序。
在Visual Studio 2012中创建一个新的CUDA 5.5项目。
使用以下代码替换kernel.cu的内容。
使用K20m或K40m在机器上编译并运行代码。
执行约2小时后，应用程序将崩溃，下面的条目将写入事件日志。
卸载驱动程序并安装以前（例如321.10）版本的驱动程序。
运行应用程序，它应该在2,3小时后运行。

日志：

Faulting application name: CudaTests60.exe, version: 0.0.0.0, time stamp: 0x5317974f
Faulting module name: nvcuda.dll, version: 8.17.13.3250, time stamp: 0x52e1fa40
Exception code: 0xc00000fd
Fault offset: 0x000000000004f5cb
Faulting process id: 0x23d0
Faulting application start time: 0x01cf38ba16961e74
Faulting application path: d:\bin\test\CudaTests60.exe
Faulting module path: C:\windows\system32\nvcuda.dll
Report Id: 192506c4-a4be-11e3-9401-90b11c4b02c0
Faulting package full name: 
Faulting package-relative application ID:

代码：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <Windows.h>

int main()
{
    cudaError_t cudaStatus;
    {
        int crow = 10000;
        int ccol = 10000;
        int cshared = 10000;
        int xLength = crow * cshared;
        int yLength = cshared * ccol;
        int matLength = crow * ccol;

        thrust::device_vector<float> x(xLength);
        thrust::device_vector<float> y(yLength);
        thrust::device_vector<float> mat(matLength);

        thrust::fill(x.begin(), x.end(), 1.0f);
        thrust::fill(y.begin(), y.end(), 1.0f);
        thrust::fill(mat.begin(), mat.end(), .0f);

        cudaStream_t ops;
        cudaStatus = cudaStreamCreate(&ops);
        assert(0 == cudaStatus);

        cublasHandle_t cbh;
        cublasStatus_t cbstatus;
        cbstatus = cublasCreate(&cbh);
        assert(0 == cbstatus);

        cbstatus = cublasSetStream(cbh, ops);
        assert(0 == cbstatus);

        float alpha = 1;
        float beta = 0;
        float* px = thrust::raw_pointer_cast(x.data());
        float* py = thrust::raw_pointer_cast(y.data());
        float* pmat = thrust::raw_pointer_cast(mat.data());
        ULONGLONG start = GetTickCount64();
        ULONGLONG iter = 0;
        while (true)
        {
            cbstatus = cublasSgemm(cbh, CUBLAS_OP_N, CUBLAS_OP_N, crow, ccol, cshared, &alpha, px, crow, py, cshared, &beta, pmat, crow);
            assert(0 == cbstatus);
            if (0 != cbstatus)
            {
                printf("cublasSgemm failed: %d.\n", cbstatus);
                break;
            }
            cudaStatus = cudaStreamSynchronize(ops);
            assert(0 == cudaStatus);
            if (0 != cudaStatus)
            {
                printf("cudaStreamSynchronize failed: %d.\n", cudaStatus);
                break;
            }

            ULONGLONG cur = GetTickCount64();
            // Exit after 2 hours.
            if (cur - start > 2 * 3600 * 1000)
                break;
            iter++;
        }

        // Crash will happen here.
        printf("Before cudaMemcpy.\n");
        float res = 0;
        cudaStatus = cudaMemcpy(&res, px, sizeof(float), cudaMemcpyDeviceToHost);
        assert(0 == cudaStatus);
        if (0 == cudaStatus)
            printf("After cudaMemcpy: %f\n", res);
        else
            printf("cudaMemcpy failed: %d\n", cudaStatus);
    }

    return 0;
}

Answer 1

程序在您指示的位置崩溃，我并不感到惊讶。这行代码是非法的：

cudaStatus = cudaMemcpy(pmat, px, x.size() * sizeof(float), cudaMemcpyDeviceToHost);

pmat和px都是指向设备内存的指针。但是，您已请求cudaMemcpyDeviceToHost，这意味着pmat指针被解释为主机指针，并在复制操作期间被取消引用。在主机代码中取消引用设备指针是非法的，并且会导致seg错误。

通过适当的修改，我在linux上运行了你的代码，它表明该行有一个seg错误。

请注意，我并不反对您指出的驱动程序可能存在问题（可能存在错误！），但我不认为此代码会重现与驱动程序错误相关的任何内容。

可以在以下位置提交错误：https://developer.nvidia.com/nvbugs/cuda/add您需要使用开发人员凭据登录。

另外，您的代码似乎在2小时后进行设计退出。我没有看到它如何表现得如你所说：

7.运行应用程序，它应该在2,3和更多小时后运行。

除非你的滴答计时系统有问题，我还没有验证。

Answer 2

在启动版本333.11的Tesla驱动程序中修复了该错误。如果您遇到同样的问题，请确保已更新驱动程序。

运行很长时间后，应用程序在nvcuda.dll中崩溃

2 个答案: