最近,我犯了一个错误,即我对TF安装不满意,并破坏了所有内容。我曾经有两个Conda env,分别为TF 1.14和2.1,Cuda 10.1,都可以正常工作。经过大量的探索之后,我现在有了TF 2.3,Cuda 10.1的主要Conda env,但是在完成所有安装libs和tensorrt的工作之后,并为TF 1.14创建了新的env(仍然有一些我尚未移植的旧代码),以前像魅力一样运作,conda install -c (conda-forge|anaconda) tensorflow-gpu
现在看不到我的GPU。
Sun Nov 1 09:15:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... On | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 6W / N/A | 11MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1469 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2719 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/usr/local/cuda:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.1:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.2:
doc lib64 LICENSE README targets version.txt
/usr/local/cuda-11.1:
include lib64 src targets
最后是错误:
In [2]: tf.test.is_gpu_available()
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
AVX2 FMA
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
Out[2]: False
(在我的另一个TF 2.3环境中,一切都很好:)
In [2]: tf.config.list_physical_devices()
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Out[2]:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
我还知道,TF worked with Cuda 10.1的Conda发行版本一直在我的机器上运行,直到昨天为止,现在我重做对我来说似乎相同的步骤,没有任何效果,所以可能是问题所在...?
有人遇到过这个吗?我还需要在另一台机器上解决这个问题,完全一样的问题,并且cuda-11.1
中没有/usr/local
……提前谢谢!
答案 0 :(得分:0)
因此,经过很多争论(这在当今时代想在一台机器上设置一个而不是一个版本,而是两个版本的TF无疑是一种疯狂的症状),我发现可行的解决方案是:
sudo apt-get install --no-install-recommends cuda-10-1
不再起作用,但是conda install cudatoolkit=10.1.243
起作用,请参见this; conda install cudatoolkit=10.2.89
,我见过人们谈论here,所以不清楚这是一个完美的解决方案(其他人将文件符号链接,或手动将它们从一个目录复制到另一个目录,那些令人难忘的日子会被记住; nvidia-smi
检查gpu是否可见,创建全新的环境并安装TF 1.4使用 anaconda 通道(conda-forge对我而言失败):conda install tensorflow-gpu=1.14
。在那儿,你应该有这个:
$ conda list | grep tensop tensor
tensorboard 1.14.0 py37hf484d3e_0 anaconda
tensorflow 1.14.0 gpu_py37h74c33d7_0 anaconda
tensorflow-base 1.14.0 gpu_py37he45bfe2_0 anaconda
tensorflow-estimator 1.14.0 py_0 anaconda
tensorflow-gpu 1.14.0 h0d30ee6_0 anaconda
而且,重要的是:
$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
如果您事先通过pip安装了TF,这将无法正常工作。
然后,激活您的其他基本环境,并使用pip完成安装
$ pip install tensorflow
哪些应该给您:
$ conda list | grep tenso tensor
tensorboard 2.3.0 pypi_0 pypi
tensorboard-plugin-wit 1.7.0 pypi_0 pypi
tensorflow 2.3.1 pypi_0 pypi
tensorflow-estimator 2.3.0 pypi_0 pypi
并且:
$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0