Question

我正在研究分布式Tensorflow。https://www.tensorflow.org/deploy/distributed

# Create and start a server for the local task.
server = tf.train.Server(cluster,
       job_name=FLAGS.job_name,
       task_index=FLAGS.task_index)

if FLAGS.job_name == "ps":
    server.join()

当我只启动一个ps服务器时。我看到它使用了所有GPU和所有GPU内存。

（我的环境：2个特斯拉K80 GPU）

+--------------------------------------------------+
| Processes:                            GPU Memory |
|  GPU       PID  Type  Process name    Usage      |
|==================================================|
|    0     22854    C   python            10891MiB |
|    1     22854    C   python            10890MiB |
+--------------------------------------------------+

根据https://www.tensorflow.org/tutorials/using_gpu，我减少了内存使用量。

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
server = tf.train.Server(cluster,
   job_name=FLAGS.job_name,
   task_index=FLAGS.task_index,
   config=config)

但我希望PS服务器只使用一个GPU，怎么做？

Answer 1

CUDA_VISIBLE_DEVICES = 0 python train.py 如果您从命令行运行以仅将使用限制为gpu id 0，这也可能会有所帮助。您还可以为工人指定多个ID，例如CUDA_VISIBLE_DEVICES = 1,2,3。

Answer 2

config.gpu_options.visible_device_list是指定哪些GPU对张量流可见的方式。

分布式tensorflow的PS服务器是否自动使用所有GPU？

2 个答案: