mxnet launch.py​​错误:检查失败:(my_node_.port)!=(-1)绑定失败

时间:2018-11-28 18:12:18

标签: python distributed-system mxnet

mxnet版本1.3.0

我正在尝试在群集的2个节点上使用[train_mnist.py] [1],其中一个具有3个GPU(nodeA),一个具有1个GPU(nodeB)。在此[blog post] [2]之后,我正在使用mxnet存储库中提供的[launch.py​​] [3]脚本来完成此操作。

我已经建立了一个包含两个主机名的hosts文件:

username@nodeA
username@nodeB

然后我尝试启动像这样的分布式培训:

    python3 launch.py -n 4 -H hosts "/opt/python/current/bin/python3 train_mnist.py  
           --network mlp --lr-factor .9 --lr .01 --kv-store dist_sync"

(请注意,我确实修改了fit.py,以便脚本检测其本地主机上有多少个GPU,并相应地设置context,并且我已经验证了它可以分别在每个节点上工作)。

我遇到此错误:

Traceback (most recent call last):
  File "train_mnist.py", line 28, in <module>
    from common import find_mxnet, fit
  File "/home/username/username/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/opt/python/current/lib/python3.6/site-packages/mxnet/__init__.py", line 57, in <module>
    from . import kvstore_server
  File "/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/opt/python/current/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [10:44:47] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

这将继续显示堆栈跟踪信息:

Stack trace returned 10 entries:
[bt] (0) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3617ba) [0x7f2cb77d37ba]
[bt] (1) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x361dd1) [0x7f2cb77d3dd1]
[bt] (2) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x312b3fa) [0x7f2cba59d3fa]
[bt] (3) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31352fa) [0x7f2cba5a72fa]
[bt] (4) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31265e9) [0x7f2cba5985e9]
[bt] (5) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2be69d3) [0x7f2cba0589d3]
[bt] (6) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x88) [0x7f2cb9e4b7f8]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f2d12c9ddae]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f2d12c9d71f]
[bt] (9) /opt/python/current/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x29f) [0x7f2d12eb24af]

到目前为止,我已经尝试过:

  • 将我的主机文件减少为仅包含一个主机。我尝试了两种选择(因此仅运行nodeA或仅运行nodeB),但遇到相同的错误
  • 改变工人的数量(但我认为这应该反映可用的gpu的数量)
  • 在回溯中提到的最后两个文件中的my_node_/opt/python/current/lib/python3.6/site-packages/mxnet/base.py中查找/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py。这些文件似乎都不包含对该变量的引用。

我不确定从这里去哪里。我的印象是,分布式培训应该可以与mxnet完美地融合在一起,所以我希望这是一个简单的解决方案。欢迎提出有关检查内容和/或如何调试此代码的建议。谢谢。

  [1]: https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_mnist.pyn
  [2]: https://tsmatz.wordpress.com/2017/02/22/mxnetr-gpu-acceleration-distributed-training-active-learning/
  [3]: https://github.com/apache/incubator-mxnet/blob/master/tools/launch.py

0 个答案:

没有答案