Question

我正在将Google Cloud VM与4个Tesla K80 GPU一起使用。

我正在使用multi_gpu_model和gpus = 4运行keras模型（因为我有4个gpu）。但是，我收到以下错误

ValueError：要用multi_gpu_model调用gpus=4，我们希望以下设备可用：['/ cpu：0'，'/ gpu：0'，'/ gpu：1'， '/ gpu：2'，'/ gpu：3']。但是此机器只有：['/ cpu：0'， '/ xla_cpu：0'，'/ xla_gpu：0'，'/ gpu：0']。尝试减少gpus。

我可以看到这里只有两个gpu，即'/xla_gpu:0', '/gpu:0'。因此，我尝试使用gpus = 2，再次遇到以下错误

ValueError：要用multi_gpu_model调用gpus=2，我们希望以下设备可用：['/ cpu：0'，'/ gpu：0'，'/ gpu：1']。但是，此计算机仅具有：['/ cpu：0'，'/ xla_cpu：0'，'/ xla_gpu：0'， '/ gpu：0']。尝试减少gpus。

谁能帮助我解决错误。谢谢！

Answer 1

看起来Keras只看到其中一个GPU。

确保所有4个GPU均可访问，您可以将device_lib与TensorFlow一起使用。

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

您可能需要在实例上手动安装或更新GPU驱动程序。请咨询here。

Answer 2

TensorFlow只看到一个GPU（gpu和xla_gpu设备是同一物理设备上的两个后端）。您要设置CUDA_VISIBLE_DEVICES吗？ nvidia-smi是否显示所有GPU？

Answer 3

我遇到了同样的问题，我想我已经找到解决方法。就我而言，我正在使用HPC，并且在/.local上安装了keras，而IT人员安装了Tensorflow和CUDA，无论如何我都遇到了相同的错误。我正在使用Tensorflow == 1.15.0和Keras == 2.3.1

我注意到消息错误：

ValueError：要使用gpus = 2调用multi_gpu_model，我们希望以下设备可用：['/ cpu：0'，'/ gpu：0'，'/ gpu：1']。但是，此计算机仅具有：['/ cpu：0'，'/ xla_cpu：0'，'/ xla_gpu：0'，'/ xla_gpu：1']。尝试减少GPU。

位于keras的以下文件中，第184行：

/home/.local/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py

我解决了这一问题，将第175行替换为以下内容：

target_devices = ['/cpu:0'] + ['/gpu:%d' % i for i in target_gpu_ids] (before)
target_devices = ['/cpu:0'] + ['/xla_gpu:%d' % i for i in target_gpu_ids] (after)

此外，我修改了以下keras文件：

/home/.local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py

所以我将510行替换为：

return [x for x in _LOCAL_DEVICES if 'device:gpu' in x.lower()] (before)
return [x for x in _LOCAL_DEVICES if 'device:XLA_GPU' in x] (after)

长话短说，这显然是Keras的错误，而不是某些环境设置。经过这样的修改，我的网络能够与xla_gpus一起运行，我希望这会有所帮助。

Answer 4

您可以使用以下代码检查所有设备列表：

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Answer 5

这可能是由于使用tensorflow而不是tensorflow-gpu造成的。

一种解决此问题的方法如下：

$ pip uninstall tensorflow
$ pip install tensorflow-gpu

更多信息可以在这里找到：https://stackoverflow.com/a/42652258/6543020

Answer 6

我有同样的问题。已安装 Tensorflow-gpu 1.14 ， CUDA 10.0 和 4个XLA_GPU ，并显示了 device_lib.list_local_devices（）。 / p>

我有另一个conda环境，仅安装了 Tensorflow 1.14 ，没有tensorflow-gpu，我也不知道为什么，但是我可以在具有该环境的所有gpu上运行multi_gpu模型。

在keras中使用multi_gpu_model时发生valueError

6 个答案: