Question

最近，我犯了一个错误，即我对TF安装不满意，并破坏了所有内容。我曾经有两个Conda env，分别为TF 1.14和2.1，Cuda 10.1，都可以正常工作。经过大量的探索之后，我现在有了TF 2.3，Cuda 10.1的主要Conda env，但是在完成所有安装libs和tensorrt的工作之后，并为TF 1.14创建了新的env（仍然有一些我尚未移植的旧代码），以前像魅力一样运作，conda install -c (conda-forge|anaconda) tensorflow-gpu现在看不到我的GPU。

Sun Nov  1 09:15:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P8     6W /  N/A |     11MiB /  5944MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2719      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

/usr/local/cuda:
bin  doc  extras  include  lib64  libnsight  libnvvp  LICENSE  nsightee_plugins  nvml  nvvm  README  samples  share  src  targets  tools  version.txt

/usr/local/cuda-10.1:
bin  doc  extras  include  lib64  libnsight  libnvvp  LICENSE  nsightee_plugins  nvml  nvvm  README  samples  share  src  targets  tools  version.txt

/usr/local/cuda-10.2:
doc  lib64  LICENSE  README  targets  version.txt

/usr/local/cuda-11.1:
include  lib64  src  targets

最后是错误：

In [2]: tf.test.is_gpu_available()                                                                                                                                                     
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX 
AVX2 FMA                                                                                                                                                                               
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz                                                                     
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:                             
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>                                                    
Out[2]: False

（在我的另一个TF 2.3环境中，一切都很好：）

In [2]: tf.config.list_physical_devices()                                                                                                                                              
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1                                           
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:                                                                   
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5                                                                                              
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s                                                                                         
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1                                      
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10                                        
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10                                         
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10                                        
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10                                      
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10                                      
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7                                          
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0                                                                     
Out[2]:                                                                                                                                                                                
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),                                                                                                                     
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),                                                                                                             
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),                                                                                                             
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

我还知道，TF worked with Cuda 10.1的Conda发行版本一直在我的机器上运行，直到昨天为止，现在我重做对我来说似乎相同的步骤，没有任何效果，所以可能是问题所在...？

有人遇到过这个吗？我还需要在另一台机器上解决这个问题，完全一样的问题，并且cuda-11.1中没有/usr/local……提前谢谢！

Answer 1

因此，经过很多争论（这在当今时代想在一台机器上设置一个而不是一个版本，而是两个版本的TF无疑是一种疯狂的症状），我发现可行的解决方案是：

在主要的TF 2.3环境中，请遵循here中所述的步骤，但有以下两项调整：
- 请勿安装张力流。
- 当前（2020年10月）sudo apt-get install --no-install-recommends cuda-10-1不再起作用，但是conda install cudatoolkit=10.1.243起作用，请参见this;
- 其他注意事项，我还注意到TF 2.3直到我才能找到整个库数组（libcublas.so.10，libcufft.so.10，libcurand.so.10等）。安装了cuda 10.2 ... conda install cudatoolkit=10.2.89，我见过人们谈论here，所以不清楚这是一个完美的解决方案（其他人将文件符号链接，或手动将它们从一个目录复制到另一个目录，那些令人难忘的日子会被记住；
- （另一种没有TensorRT的选项，但是对于清除cuda和nvidia的东西非常有用，并且具有故障保护功能，here）
在安装所有库，cuda等之后（此时您需要重新启动，并且您可以使用nvidia-smi检查gpu是否可见，创建全新的环境并安装TF 1.4使用 anaconda 通道（conda-forge对我而言失败）：conda install tensorflow-gpu=1.14。
最后，最后回到主环境并使用pip安装tensorflow。

在那儿，你应该有这个：

$ conda list | grep tensop tensor
tensorboard               1.14.0           py37hf484d3e_0    anaconda
tensorflow                1.14.0          gpu_py37h74c33d7_0    anaconda
tensorflow-base           1.14.0          gpu_py37he45bfe2_0    anaconda
tensorflow-estimator      1.14.0                     py_0    anaconda
tensorflow-gpu            1.14.0               h0d30ee6_0    anaconda

而且，重要的是：

$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0

如果您事先通过pip安装了TF，这将无法正常工作。

然后，激活您的其他基本环境，并使用pip完成安装

$ pip install tensorflow

哪些应该给您：

$ conda list | grep tenso tensor
tensorboard               2.3.0                    pypi_0    pypi
tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
tensorflow                2.3.1                    pypi_0    pypi
tensorflow-estimator      2.3.0                    pypi_0    pypi

并且：

$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0

使用Conda（1.14 vs 2.3），Ubuntu 18.04调试损坏的Tensorflow-gpu安装

1 个答案: