我想尝试在其分布式模式下运行tensorflow。作为测试程序,我选择了Google的Inception。 与一些工人一起运行时,其中之一总是会出错。其他工作进程仍在运行,但未产生任何结果。
RROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables/boolean_mask/Gather:0' shape=(?,) dtype=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
由于我在大学里的设置,我只能运行tensorflow 1.3.0。我已经尝试了初始存储库的不同分支(例如master,1.5.0),但总是会收到此错误。
出现错误的worker3的日志。 (将其发布到此处的时间有所缩短)
[31m[1m WORKER CUDA Devices: 2 [0m
INFO:tensorflow:PS hosts are: ['tensorsrv1:12000']
INFO:tensorflow:Worker hosts are: ['tensorsrv1:12001', 'tensorsrv1:12002', 'tensorsrv1:12003', 'tensorsrv1:12004']
2018-06-20 12:15:11.189503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:84:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-06-20 12:15:11.189700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2018-06-20 12:15:11.189763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2018-06-20 12:15:11.189830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:84:00.0)
MPI Environment initialised. Process id: 3 Total processes: 5 || Hostname: tensorsrv1
D0620 12:15:11.507919397 18686 env_linux.c:77] Warning: insecure environment read function 'getenv' used
2018-06-20 12:15:11.516434: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tensorsrv1:12000}
2018-06-20 12:15:11.516569: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tensorsrv1:12001, 1 -> tensorsrv1:12002, 2 -> localhost:12003, 3 -> tensorsrv1:12004}
2018-06-20 12:15:11.516871: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:12003
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=4; total_num_replicas=4
INFO:tensorflow:2018-06-20 12:16:44.998266 Supervisor
2018-06-20 12:16:47.722197: I tensorflow/core/distributed_runtime/master_session.cc:998] Start master session c18a779f191ac2a4 with config: allow_soft_placement: true
INFO:tensorflow:Waiting for model to be ready. Ready_for_local_init_op: None, ready: Variables not initialized: global_step, conv0/weights, conv0/BatchNorm/beta, conv0/BatchNorm/moving_mean, conv0/BatchNorm/moving_variance, [...]
[...]
2018-06-20 12:17:23.765818: I tensorflow/core/distributed_runtime/master_session.cc:998] Start master session bb98c90a9044da55 with config: allow_soft_placement: true
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Started 3 queues for processing input data.
INFO:tensorflow:Worker 2: 2018-06-20 12:17:48.049978: step 0, loss = 13.13(1.5 examples/sec; 21.359 sec/batch)
INFO:tensorflow:Worker 2: 2018-06-20 12:17:50.677927: step 0, loss = 13.09(12.2 examples/sec; 2.628 sec/batch)
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables/boolean_mask/Gather:0' shape=(?,) dtype=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/shared/tensorflow//models/research/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 66, in <module>\n tf.app.run()', 'File "/sw/tensorflow/1.3.0-Python-3.5.2/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run\n _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/shared/tensorflow//models/research/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 62, in main\n inception_distributed_train.train(server.target, dataset, cluster_spec)', 'File "/shared/tensorflow//models/research/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/inception_distributed_train.py", line 222, in train\n apply_gradients_op = opt.apply_gradients(grads, global_step=global_step)', 'File "/sw/tensorflow/1.3.0-Python-3.5.2/lib/python3.5/site-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 257, in apply_gradients\n variables.global_variables())', 'File "/sw/tensorflow/1.3.0-Python-3.5.2/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py", line 175, in wrapped\n return _add_should_use_warning(fn(*args, **kwargs))', 'File "/sw/tensorflow/1.3.0-Python-3.5.2/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py", line 144, in _add_should_use_warning\n wrapped = TFShouldUseWarningWrapper(x)', 'File "/sw/tensorflow/1.3.0-Python-3.5.2/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py", line 101, in __init__\n stack = [s.strip() for s in traceback.format_stack()]']
==================================
每个过程的完整日志都可以在pastebin上找到,因为它们对于这篇文章来说太长了。
我从
开始python bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir="${DATA_DIR}" \
--job_name="${JOB_NAME}" \
--task_id="${TASK_IDX}" \
--ps_hosts="${PS_STR}" \
--worker_hosts="${WORKER_STR}" \
--protocol="grpc+mpi" \
--max_steps=100