当我使用docker-compose up
启动程序时,代码运行良好!但是,当我使用docker stack deploy -c docker-compose.yml test
启动该程序时,它找不到可见的nvidia设备。我的docker-compose.yml和错误日志如下所示。
我对为什么我使用相同的配置,使用docker-compose up
和docker stack deploy -c docker-compose.yml test
的不同分配方式感到非常困惑,第一个效果不错,但第二个效果不佳。对于支持GPU的docker swarm来说,它当前是否不完美,还是我找不到其他方法?
docker version: 18.06.0-ce
NVIDIA Docker: 1.0.1
Ubuntu: 16:04
当然,我修改了文件/etc/docker/daemon.json,更改了运行时类型。并重新启动它。
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl daemon-reload
sudo systemctl start docker
version: "3"
volumes:
nvidia_driver_430.14:
external: true
services:
tts-server:
build:
context: ./
dockerfile: ./docker/tts_server/Dockerfile
deploy:
replicas: 1
image: tts-system/tts-server-gpu
environment:
NVIDIA_VISIBLE_DEVICES: 0
devices:
- /dev/nvidia0
- /dev/nvidiactl
- /dev/nvidia-uvm
volumes:
- ./models:/tts_system/models:ro
- ./config:/tts_system/config:ro
- nvidia_driver_430.14:/usr/local/nvidia:ro
networks:
- overlay
ports:
- "9091:9090"
2019-07-02 07:50:24.805114: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499885000 Hz
2019-07-02 07:50:24.808418: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4112870 executing computations on platform Host. Devices:
2019-07-02 07:50:24.808457: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-07-02 07:50:24.811640: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-07-02 07:50:24.811684: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:155] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
E0702 07:50:24.811846 1 decoder.cc:80] Filed to create session: Invalid argument: 'visible_device_list' listed an invalid GPU id '0' but visible device count is -1
这个问题困扰了我很长时间,非常感谢。
答案 0 :(得分:1)
根据问题
https://github.com/docker/compose/issues/6691
docker-compose版本3中尚未对--gpus和nvidia设备的运行时引用提供官方支持。
但是您可以安装nvidia-docker版本2并在/etc/docker/daemon.json中进行以下配置 为了使群集服务中的可用nvidia可见设备。
/etc/docker/daemon.json
{
"default-runtime":"nvidia",
"runtimes":{
"nvidia":{
"path":"nvidia-container-runtime",
"runtimeArgs":[
]
}
}
}
docker-compose.yaml
以以下格式在组合文件中添加环境密钥。
...
environment:
- NVIDIA_VISIBLE_DEVICES=0
...
允许的NVIDIA_VISIBLE_DEVICES值为
NVIDIA_VISIBLE_DEVICES =全部
引用:https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices
上述配置对我来说似乎很好。