我有两台机器,每台机器有4个GPU。我用
with tf.device('/job:worker/replica:%d/task:%d/gpu:%d' % (FLAGS.replica_id, FLAGS.task_id, FLAGS.gpu_device_id)):
指示设备,但失败了这些错误日志:
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'init_all_tables': Could not satisfy explicit device specification '/job:worker/replica:1/task:4/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:ps/replica:0/task:0/cpu:0,
/job:worker/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/gpu:0, /job:worker/replica:0/task:0/gpu:1, /job:worker/replica:0/task:0/gpu:2, /job:worker/replica:0/task:0/gpu:3, /job:worker/replica:0/task:1/cpu:0, /job:worker/replica:0/task:1/gpu:0, /job:worker/replica:0/task:1/gpu:1, /job:worker/replica:0/task:1/gpu:2, /job:worker/replica:0/task:1/gpu:3, /job:worker/replica:0/task:2/cpu:0, /job:worker/replica:0/task:2/gpu:0, /job:worker/replica:0/task:2/gpu:1, /job:worker/replica:0/task:2/gpu:2, /job:worker/replica:0/task:2/gpu:3, /job:worker/replica:0/task:4/cpu:0, /job:worker/replica:0/task:4/gpu:0, /job:worker/replica:0/task:4/gpu:1, /job:worker/replica:0/task:4/gpu:2, /job:worker/replica:0/task:4/gpu:3, /job:worker/replica:0/task:5/cpu:0, /job:worker/replica:0/task:5/gpu:0, /job:worker/replica:0/task:5/gpu:1, /job:worker/replica:0/task:5/gpu:2, /job:worker/replica:0/task:5/gpu:3, /job:worker/replica:0/task:6/cpu:0, /job:worker/replica:0/task:6/gpu:0, /job:worker/replica:0/task:6/gpu:1, /job:worker/replica:0/task:6/gpu:2, /job:worker/replica:0/task:6/gpu:3, /job:worker/replica:0/task:7/cpu:0, /job:worker/replica:0/task:7/gpu:0, /job:worker/replica:0/task:7/gpu:1, /job:worker/replica:0/task:7/gpu:2, /job:worker/replica:0/task:7/gpu:3
似乎tensorflow无法找到机器B?但我在两台机器上都有完全相同的硬件和软件配置。
启动脚本:
# machine 10.10.12.28
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=0 \
--gpu_device_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=1 \
--gpu_device_id=1 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=2 \
--gpu_device_id=2 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=3 \
--gpu_device_id=3 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
CUDA_VISIBLE_DEVICES='' ~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
-task_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
# machine 10.10.12.29
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=4 \
--gpu_device_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=5 \
--gpu_device_id=1 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=6 \
--gpu_device_id=2 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=7 \
--gpu_device_id=3 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
答案 0 :(得分:3)
TL; DR:请勿在您的设备规范中使用'/replica:%d'
。
问题似乎出现在您的设备字符串中:
'/job:worker/replica:%d/task:%d/gpu:%d' % (FLAGS.replica_id, FLAGS.task_id, FLAGS.gpu_device_id)
TensorFlow的开源版本不支持设备规范'/replica:%d'
(但由于某些向后兼容性原因,它会被保留)。所有任务的副本ID应为0。您可以通过为每个任务传递0作为--replica_id
来立即解决此问题,但您应该从您的代码版本中删除该标志。