执行:
python tf_cnn_benchmarks.py \
--local_parameter_device=gpu \
--num_gpus=1 \
--batch_size=2 \
--model=alexnet \
--variable_update=distributed_replicated \
--job_name=ps \
--ps_hosts=192.168.230.107:50000 \
--worker_hosts=192.168.230.107:60000,192.168.230.108:60000 \
--task_index=0 \
--server_protocol=grpc+verbs
以下消息失败:
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1258, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1248, in main
bench = BenchmarkCNN()
File "tf_cnn_benchmarks.py", line 525, in __init__
protocol=FLAGS.server_protocol)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 145, in __init__
self._server_def.SerializeToString(), status)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: No server factory registered for the given ServerDef: cluster {
job {
name: "ps"
tasks {
key: 0
value: "192.168.230.107:50000"
}
}
job {
name: "worker"
tasks {
key: 0
value: "192.168.230.107:60000"
}
tasks {
key: 1
value: "192.168.230.108:60000"
}
}
}
job_name: "ps"
default_session_config {
intra_op_parallelism_threads: 1
gpu_options {
force_gpu_compatible: true
}
allow_soft_placement: true
}
protocol: "grpc+verbs"
Infiniband信息:
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.40.7000
node_guid: 248a:0703:00df:3ad0
sys_image_guid: 248a:0703:00df:3ad3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: MT_1090120019
phys_port_cnt: 2
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 5
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
我的Tensorflow是tensorflow-gpu:1.4.1版本。 Tensorflow可以在Infiniband卡上使用RDMA吗?
答案 0 :(得分:1)
你确定你的张量流支持RDMA吗?如果你想使用grpc + verbs协议,你必须自己为源代码制作张量流,并在配置步骤中选择RDMA支持。