在多个节点上运行分布式培训的命令是什么,其中每个节点都有多个GPU。 https://github.com/tensorflow/models/tree/master/inception中的示例仅显示每个节点具有1个GPU / 1工作者的情况。在我的集群中,每个节点有4个GPU,需要4个工作人员。
我尝试了以下命令: 在节点0上:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=0
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=3
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'
节点1上的:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=4
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=7
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'
请注意,有&在每个命令的末尾,以便它们可以并行执行,但它没有GPU内存错误。
我还试图在每个节点中只使用1个worker,每个worker使用4个GPU: 在节点0上:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--gpus=4
--task_id=0
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
节点1上的:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--gpus=4
--task_id=1
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
但最终每个节点只使用1个GPU。
那么我应该使用的确切命令是什么?感谢。