我在pyTorch中使用openmpi后端实现了分布式神经网络代码。当我在具有单个GPU(产生了4个进程)的本地计算机上运行此代码时,它运行得很好。
但是当我尝试在具有2个GPU的服务器上运行相同的代码(每个GPU上又有4个进程2个)时,会出现以下错误:
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: g2-nasp
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
Start node: 1 Total: 3
Start node: 2 Total: 3
Start node: 0 Total: 3
[1557057502.671443] [g2-nasp:29190:0] cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 131072, error message Bad address
[g2-nasp:29185] 11 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[g2-nasp:29185] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
当我尝试使用pytorch的分布式软件包进行通信时,似乎会出现错误行。如果我取消了通讯线路,则培训过程似乎是独立运行的,没有任何同步。同步代码如下:
def broadcast(data, rank, world_size, recv_buff_l, recv_buff_r):
left = ((rank - 1) + world_size) % world_size
right = (rank + 1) % world_size
send_req_l = dist.isend(data, dst=left)
recv_req_r = dist.irecv(recv_buff_r, src=right)
recv_req_r.wait()
send_reql.wait()
send_req_r = dist.isend(data, dst=right)
recv_req_l = dist.irecv(recv_buff_l, src=left)
recv_req_l.wait()
send_req_r.wait()
我不确定是什么原因导致此错误。虽然可以,但我使用的是以下环境设置:
Local Machine:
Ubuntu 18.04
CUDA 10.1
CuDNN 7.5
OpenMPI 4.0.1
GPU: NVIDIA GTX 960M
Server Machine:
Ubuntu 16.04
CUDA 10.1
CuDNN 7.5
OpenMPI 3.1.4
GPU: NVIDIA Tesla K40c [2 GPUs]
任何人都可以建议我在做什么错吗?为什么同步似乎不适用于多个GPU?
更新01:经过一些修改,我尝试在服务器的单个GPU上运行代码,结果在通信时发生了相同的错误。有想法吗?