Question

该作业可以在CPU上成功完成，但是没有使用GPU。当我在JupyterNotebook中执行代码，并且Jupyter的控制台显示错误消息：cuInit调用失败：CUDA_ERROR_NOT_INITILIZED：初始化错误。

以下是硬件和软件信息：

操作系统：我试图在Ubuntu180403 ppc64le和RHEL76中运行代码，没有操作系统可以在GPU上运行Job
CUDA：10.1.243
GPU驱动程序：418.87.00
CUDA工具包：10.1
TensorFlow：14.01a，实际上它包含在IBM powerai CE 1.6.1中
硬件：AC922，4 * GPU是nvidia V100

我试图执行CNN训练作业，或者只是尝试通过遵循以下代码来查找本地设备，并且仅列出了CPU。

from tensorflow.python.client import device_lib as _device_lib
_device_lib.list_local_devices()

Jupyter笔记本电脑的控制台显示以下错误：

[tensorflow/stream_executor/cuda/cuda_driver.cc:318] 
Failed call to cuInit: CUDA_ERROR_NOT_INITILIZED: initialization error
[tensorflow/stream_executor/cuda/cuda_diagonostics.cc:169] 
Retrieving CUDA diagnostic information for host: powerai
[tensorflow/stream_executor/cuda/cuda_diagonostics.cc:176] 
hostname: powerai 
[tensorflow/stream_executor/cuda/cuda_diagonostics.cc:200] 
libcuda reported version is : 418.87.0
[tensorflow/stream_executor/cuda/cuda_diagonostics.cc:204] 
kernel reported version is : 418.78.0
[tensorflow/stream_executor/cuda/cuda_diagonostics.cc:310] 
kernel version seems to match BSO: 418.87.0

之后，我尝试通过运行CUDA示例来检查CUDA是否良好，并且CUDA抛出以下错误：

$ sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

Answer 1

在PPC RHEL上，product documentation中记录了GPU使用所需的许多配置步骤： *设置 udev 规则 * nvidia持久化服务的配置

您还可以引用Troubleshooting NVIDIA GPU driver issues。

Nvidia设备上的SELinux设置也可能导致GPU访问问题。尝试暂时关闭SELinux（setenforce 0），以查看是否可以解决问题。如果是这样，请运行

  restorecon -v -R /usr/
  restorecon -v -R /dev/

并再次激活SELinux，希望可以解决此问题。

最后，在Power9系统上存在已知的竞争状况，请参见What to do with “cudaSuccess (3 vs. 0) initialization error” on a POWER9 system?。

Answer 2

一些可能性：

在运行代码之前，您可以确认是否正在使用GPU（或可用GPU）吗？
您是否有权提交使用GPU运行的作业（如果您的设置基于基于作业的系统）？

对cuInit的调用失败：CUDA_ERROR_NOT_INITILIZED：当我在Jupyter Notebook中执行代码并且未使用GPU时，初始化错误

2 个答案: