如何在每个节点有多个worker的情况下执行分布式培训

时间:2017-04-07 18:58:50

标签: tensorflow distributed

在多个节点上运行分布式培训的命令是什么,其中每个节点都有多个GPU。 https://github.com/tensorflow/models/tree/master/inception中的示例仅显示每个节点具有1个GPU / 1工作者的情况。在我的集群中,每个节点有4个GPU,需要4个工作人员。

我尝试了以下命令: 在节点0上:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=0 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=3 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'
节点1上的

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=4 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=7 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'

请注意,有&在每个命令的末尾,以便它们可以并行执行,但它没有GPU内存错误。

我还试图在每个节点中只使用1个worker,每个worker使用4个GPU: 在节点0上:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--gpus=4
--task_id=0 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
节点1上的

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--gpus=4
--task_id=1 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

但最终每个节点只使用1个GPU。

那么我应该使用的确切命令是什么?感谢。

0 个答案:

没有答案