mxnet
版本1.3.0
我正在尝试在群集的2个节点上使用[train_mnist.py] [1],其中一个具有3个GPU(nodeA),一个具有1个GPU(nodeB)。在此[blog post] [2]之后,我正在使用mxnet
存储库中提供的[launch.py] [3]脚本来完成此操作。
我已经建立了一个包含两个主机名的hosts
文件:
username@nodeA
username@nodeB
然后我尝试启动像这样的分布式培训:
python3 launch.py -n 4 -H hosts "/opt/python/current/bin/python3 train_mnist.py
--network mlp --lr-factor .9 --lr .01 --kv-store dist_sync"
(请注意,我确实修改了fit.py,以便脚本检测其本地主机上有多少个GPU,并相应地设置context
,并且我已经验证了它可以分别在每个节点上工作)。
我遇到此错误:
Traceback (most recent call last):
File "train_mnist.py", line 28, in <module>
from common import find_mxnet, fit
File "/home/username/username/common/find_mxnet.py", line 20, in <module>
import mxnet as mx
File "/opt/python/current/lib/python3.6/site-packages/mxnet/__init__.py", line 57, in <module>
from . import kvstore_server
File "/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 85, in <module>
_init_kvstore_server_module()
File "/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/opt/python/current/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [10:44:47] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
这将继续显示堆栈跟踪信息:
Stack trace returned 10 entries:
[bt] (0) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3617ba) [0x7f2cb77d37ba]
[bt] (1) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x361dd1) [0x7f2cb77d3dd1]
[bt] (2) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x312b3fa) [0x7f2cba59d3fa]
[bt] (3) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31352fa) [0x7f2cba5a72fa]
[bt] (4) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31265e9) [0x7f2cba5985e9]
[bt] (5) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2be69d3) [0x7f2cba0589d3]
[bt] (6) /opt/python/current/lib/python3.6/site-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x88) [0x7f2cb9e4b7f8]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f2d12c9ddae]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f2d12c9d71f]
[bt] (9) /opt/python/current/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x29f) [0x7f2d12eb24af]
到目前为止,我已经尝试过:
nodeA
或仅运行nodeB
),但遇到相同的错误my_node_
和/opt/python/current/lib/python3.6/site-packages/mxnet/base.py
中查找/opt/python/current/lib/python3.6/site-packages/mxnet/kvstore_server.py
。这些文件似乎都不包含对该变量的引用。 我不确定从这里去哪里。我的印象是,分布式培训应该可以与mxnet
完美地融合在一起,所以我希望这是一个简单的解决方案。欢迎提出有关检查内容和/或如何调试此代码的建议。谢谢。
[1]: https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_mnist.pyn
[2]: https://tsmatz.wordpress.com/2017/02/22/mxnetr-gpu-acceleration-distributed-training-active-learning/
[3]: https://github.com/apache/incubator-mxnet/blob/master/tools/launch.py