Tensorflow无法在Infiniband网络上使用RDMA

时间:2018-03-22 11:25:50

标签: tensorflow infiniband rdma

执行:

python tf_cnn_benchmarks.py \
    --local_parameter_device=gpu \
    --num_gpus=1 \
    --batch_size=2 \
    --model=alexnet \
    --variable_update=distributed_replicated \
    --job_name=ps \
    --ps_hosts=192.168.230.107:50000 \
    --worker_hosts=192.168.230.107:60000,192.168.230.108:60000 \
    --task_index=0 \
    --server_protocol=grpc+verbs

以下消息失败:

    Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 1258, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1248, in main
    bench = BenchmarkCNN()
  File "tf_cnn_benchmarks.py", line 525, in __init__
    protocol=FLAGS.server_protocol)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 145, in __init__
    self._server_def.SerializeToString(), status)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: No server factory registered for the given ServerDef: cluster {
  job {
    name: "ps"
    tasks {
      key: 0
      value: "192.168.230.107:50000"
    }
  }
  job {
    name: "worker"
    tasks {
      key: 0
      value: "192.168.230.107:60000"
    }
    tasks {
      key: 1
      value: "192.168.230.108:60000"
    }
  }
}
job_name: "ps"
default_session_config {
  intra_op_parallelism_threads: 1
  gpu_options {
    force_gpu_compatible: true
  }
  allow_soft_placement: true
}
protocol: "grpc+verbs"

Infiniband信息:

hca_id: mlx4_0
    transport:                      InfiniBand (0)
    fw_ver:                         2.40.7000
    node_guid:                      248a:0703:00df:3ad0
    sys_image_guid:                 248a:0703:00df:3ad3
    vendor_id:                      0x02c9
    vendor_part_id:                 4099
    hw_ver:                         0x1
    board_id:                       MT_1090120019
    phys_port_cnt:                  2
    Device ports:
            port:   1
                    state:                  PORT_ACTIVE (4)
                    max_mtu:                4096 (5)
                    active_mtu:             4096 (5)
                    sm_lid:                 1
                    port_lid:               5
                    port_lmc:               0x00
                    link_layer:             InfiniBand

            port:   2
                    state:                  PORT_DOWN (1)
                    max_mtu:                4096 (5)
                    active_mtu:             4096 (5)
                    sm_lid:                 0
                    port_lid:               0
                    port_lmc:               0x00
                    link_layer:             InfiniBand

我的Tensorflow是tensorflow-gpu:1.4.1版本。 Tensorflow可以在Infiniband卡上使用RDMA吗?

1 个答案:

答案 0 :(得分:1)

你确定你的张量流支持RDMA吗?如果你想使用grpc + verbs协议,你必须自己为源代码制作张量流,并在配置步骤中选择RDMA支持。