两台计算机上的分布式跟踪失败:InvalidArgumentError

时间:2016-05-12 05:28:24

标签: tensorflow

我有两台机器,每台机器有4个GPU。我用

with tf.device('/job:worker/replica:%d/task:%d/gpu:%d' % (FLAGS.replica_id, FLAGS.task_id, FLAGS.gpu_device_id)):

指示设备,但失败了这些错误日志:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'init_all_tables': Could not satisfy explicit device specification '/job:worker/replica:1/task:4/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:ps/replica:0/task:0/cpu:0, 
/job:worker/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/gpu:0, /job:worker/replica:0/task:0/gpu:1, /job:worker/replica:0/task:0/gpu:2, /job:worker/replica:0/task:0/gpu:3, /job:worker/replica:0/task:1/cpu:0, /job:worker/replica:0/task:1/gpu:0, /job:worker/replica:0/task:1/gpu:1, /job:worker/replica:0/task:1/gpu:2, /job:worker/replica:0/task:1/gpu:3, /job:worker/replica:0/task:2/cpu:0, /job:worker/replica:0/task:2/gpu:0, /job:worker/replica:0/task:2/gpu:1, /job:worker/replica:0/task:2/gpu:2, /job:worker/replica:0/task:2/gpu:3, /job:worker/replica:0/task:4/cpu:0, /job:worker/replica:0/task:4/gpu:0, /job:worker/replica:0/task:4/gpu:1, /job:worker/replica:0/task:4/gpu:2, /job:worker/replica:0/task:4/gpu:3, /job:worker/replica:0/task:5/cpu:0, /job:worker/replica:0/task:5/gpu:0, /job:worker/replica:0/task:5/gpu:1, /job:worker/replica:0/task:5/gpu:2, /job:worker/replica:0/task:5/gpu:3, /job:worker/replica:0/task:6/cpu:0, /job:worker/replica:0/task:6/gpu:0, /job:worker/replica:0/task:6/gpu:1, /job:worker/replica:0/task:6/gpu:2, /job:worker/replica:0/task:6/gpu:3, /job:worker/replica:0/task:7/cpu:0, /job:worker/replica:0/task:7/gpu:0, /job:worker/replica:0/task:7/gpu:1, /job:worker/replica:0/task:7/gpu:2, /job:worker/replica:0/task:7/gpu:3
似乎tensorflow无法找到机器B?但我在两台机器上都有完全相同的硬件和软件配置。

启动脚本:

# machine 10.10.12.28
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=0 \
--gpu_device_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=1 \
--gpu_device_id=1 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=2 \
--gpu_device_id=2 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=3 \
--gpu_device_id=3 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &


CUDA_VISIBLE_DEVICES='' ~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
-task_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

# machine 10.10.12.29

~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=4 \
--gpu_device_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=5 \
--gpu_device_id=1 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=6 \
--gpu_device_id=2 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=7 \
--gpu_device_id=3 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &

1 个答案:

答案 0 :(得分:3)

TL; DR:请勿在您的设备规范中使用'/replica:%d'

问题似乎出现在您的设备字符串中:

'/job:worker/replica:%d/task:%d/gpu:%d' % (FLAGS.replica_id, FLAGS.task_id, FLAGS.gpu_device_id)

TensorFlow的开源版本不支持设备规范'/replica:%d'(但由于某些向后兼容性原因,它会被保留)。所有任务的副本ID应为0。您可以通过为每个任务传递0作为--replica_id来立即解决此问题,但您应该从您的代码版本中删除该标志。