使用Conda(1.14 vs 2.3),Ubuntu 18.04调试损坏的Tensorflow-gpu安装

时间:2020-11-01 08:27:46

标签: tensorflow conda

最近,我犯了一个错误,即我对TF安装不满意,并破坏了所有内容。我曾经有两个Conda env,分别为TF 1.14和2.1,Cuda 10.1,都可以正常工作。经过大量的探索之后,我现在有了TF 2.3,Cuda 10.1的主要Conda env,但是在完成所有安装libs和tensorrt的工作之后,并为TF 1.14创建了新的env(仍然有一些我尚未移植的旧代码),以前像魅力一样运作,conda install -c (conda-forge|anaconda) tensorflow-gpu现在看不到我的GPU。

Sun Nov  1 09:15:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P8     6W /  N/A |     11MiB /  5944MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2719      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/usr/local/cuda:
bin  doc  extras  include  lib64  libnsight  libnvvp  LICENSE  nsightee_plugins  nvml  nvvm  README  samples  share  src  targets  tools  version.txt

/usr/local/cuda-10.1:
bin  doc  extras  include  lib64  libnsight  libnvvp  LICENSE  nsightee_plugins  nvml  nvvm  README  samples  share  src  targets  tools  version.txt

/usr/local/cuda-10.2:
doc  lib64  LICENSE  README  targets  version.txt

/usr/local/cuda-11.1:
include  lib64  src  targets

最后是错误:

In [2]: tf.test.is_gpu_available()                                                                                                                                                     
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX 
AVX2 FMA                                                                                                                                                                               
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz                                                                     
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:                             
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>                                                    
Out[2]: False    

(在我的另一个TF 2.3环境中,一切都很好:)

In [2]: tf.config.list_physical_devices()                                                                                                                                              
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1                                           
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:                                                                   
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5                                                                                              
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s                                                                                         
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1                                      
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10                                        
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10                                         
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10                                        
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10                                      
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10                                      
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7                                          
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0                                                                     
Out[2]:                                                                                                                                                                                
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),                                                                                                                     
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),                                                                                                             
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),                                                                                                             
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]   

我还知道,TF worked with Cuda 10.1的Conda发行版本一直在我的机器上运行,直到昨天为止,现在我重做对我来说似乎相同的步骤,没有任何效果,所以可能是问题所在...?

有人遇到过这个吗?我还需要在另一台机器上解决这个问题,完全一样的问题,并且cuda-11.1中没有/usr/local……提前谢谢!

1 个答案:

答案 0 :(得分:0)

因此,经过很多争论(这在当今时代想在一台机器上设置一个而不是一个版本,而是两个版本的TF无疑是一种疯狂的症状),我发现可行的解决方案是:

  • 在主要的TF 2.3环境中,请遵循here中所述的步骤,但有以下两项调整:
    • 请勿安装张力流。
    • 当前(2020年10月)sudo apt-get install --no-install-recommends cuda-10-1不再起作用,但是conda install cudatoolkit=10.1.243起作用,请参见this;
    • 其他注意事项,我还注意到TF 2.3直到我才能找到整个库数组(libcublas.so.10,libcufft.so.10,libcurand.so.10等)。安装了cuda 10.2 ... conda install cudatoolkit=10.2.89,我见过人们谈论here,所以不清楚这是一个完美的解决方案(其他人将文件符号链接,或手动将它们从一个目录复制到另一个目录,那些令人难忘的日子会被记住;
    • (另一种没有TensorRT的选项,但是对于清除cuda和nvidia的东西非常有用,并且具有故障保护功能,here
  • 在安装所有库,cuda等之后(此时您需要重新启动,并且您可以使用nvidia-smi检查gpu是否可见,创建全新的环境并安装TF 1.4使用 anaconda 通道(conda-forge对我而言失败):conda install tensorflow-gpu=1.14
  • 最后,最后回到主环境并使用pip安装tensorflow。

在那儿,你应该有这个:

$ conda list | grep tensop tensor
tensorboard               1.14.0           py37hf484d3e_0    anaconda
tensorflow                1.14.0          gpu_py37h74c33d7_0    anaconda
tensorflow-base           1.14.0          gpu_py37he45bfe2_0    anaconda
tensorflow-estimator      1.14.0                     py_0    anaconda
tensorflow-gpu            1.14.0               h0d30ee6_0    anaconda

而且,重要的是:

$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0

如果您事先通过pip安装了TF,这将无法正常工作。

然后,激活您的其他基本环境,并使用pip完成安装

$ pip install tensorflow

哪些应该给您:

$ conda list | grep tenso tensor
tensorboard               2.3.0                    pypi_0    pypi
tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
tensorflow                2.3.1                    pypi_0    pypi
tensorflow-estimator      2.3.0                    pypi_0    pypi

并且:

$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0