Question

我已经在我的VM上安装了所有要求的软件包，但没有安装nvidia GPU驱动程序。在要求中没有nvidia GPU驱动程序安装说明，我想知道哪个cuda版本及其兼容的nvidia驱动程序需要哪个也可以解决以下错误。

Github链接：github

错误日志：

  File "run_ner.py", line 594, in <module>
    main()
  File "run_ner.py", line 489, in main
    loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask)
  File "/home/pt3_gcp/BERT-NER/ber_ner/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "run_ner.py", line 35, in forward
    valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device='cuda')
  File "/home/pt3_gcp/BERT-NER/ber_ner/lib/python3.7/site-packages/torch/cuda/__init__.py", line 178, in _lazy_init
    _check_driver()
  File "/home/pt3_gcp/BERT-NER/ber_ner/lib/python3.7/site-packages/torch/cuda/__init__.py", line 99, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
**Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
**

通过以下链接安装最新的cuda版本后， cuda我遇到以下错误，

06/04/2020 07:38:40 - INFO - __main__ -   ***** Running training *****
06/04/2020 07:38:40 - INFO - __main__ -     Num examples = 14041
06/04/2020 07:38:40 - INFO - __main__ -     Batch size = 32
06/04/2020 07:38:40 - INFO - __main__ -     Num steps = 2190
Epoch:   0%|                                                                                 | 0/5 [00:00<?, ?it/sTHCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=38 : no CUDA-capable device is detectedt/s]
Traceback (most recent call last):
  File "run_ner.py", line 594, in <module>
    main()
  File "run_ner.py", line 489, in main
    loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask)
  File "/home/pt3_gcp/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "run_ner.py", line 35, in forward
    valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device='cuda')
  File "/home/pt3_gcp/.local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 179, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50

Answer 1

我前段时间遇到了同样的问题。以下命令为我修复！

如果您有多个安装，这是一个问题，并且由于您尝试了很多东西，现在您可能已经安装了。基本上删除所有内容

sudo apt-get purge nvidia-*
sudo apt-get remove nvidia-cuda-toolkit
sudo apt autoremove --purge cuda-10-0 // you might have a different version, check it git cuda --version

同时删除用户群中的现有文件

rm -rf /usr/local/cuda* // anything related to cuda
rm -rf /usr/local/nvidia* // anything related to nvidia

现在，终于重新安装

sudo apt-get update // update your packages

sudo apt search nvidia-driver  // to get the latest version of the driver. After finding out the latest version, install it with

sudo apt install nvidia-driver-450 (or any other number, depending on the latest version)

安装后必须重启！

sudo reboot

当您回来时，nvidia-smi 和您的 gpu 应该可以工作

CUDA运行时错误：哪个兼容的Cuda版本可以使用BERT-NER运行NER任务

1 个答案: