PyTorch分布式部署的错误地址错误

时间:2019-05-05 12:05:10

标签: python pytorch openmpi

我在pyTorch中使用openmpi后端实现了分布式神经网络代码。当我在具有单个GPU(产生了4个进程)的本地计算机上运行此代码时,它运行得很好。

但是当我尝试在具有2个GPU的服务器上运行相同的代码(每个GPU上又有4个进程2个)时,会出现以下错误:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           g2-nasp
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
Start node: 1  Total:   3
Start node: 2  Total:   3
Start node: 0  Total:   3
[1557057502.671443] [g2-nasp:29190:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 131072, error message Bad address
[g2-nasp:29185] 11 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[g2-nasp:29185] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

当我尝试使用pytorch的分布式软件包进行通信时,似乎会出现错误行。如果我取消了通讯线路,则培训过程似乎是独立运行的,没有任何同步。同步代码如下:

def broadcast(data, rank, world_size, recv_buff_l, recv_buff_r):
    left = ((rank - 1) + world_size) % world_size
    right = (rank + 1) % world_size
    send_req_l = dist.isend(data, dst=left)
    recv_req_r = dist.irecv(recv_buff_r, src=right)
    recv_req_r.wait()
    send_reql.wait()
    send_req_r = dist.isend(data, dst=right)
    recv_req_l = dist.irecv(recv_buff_l, src=left)
    recv_req_l.wait()
    send_req_r.wait()

我不确定是什么原因导致此错误。虽然可以,但我使用的是以下环境设置:

Local Machine:
    Ubuntu 18.04
    CUDA 10.1
    CuDNN 7.5
    OpenMPI 4.0.1
    GPU: NVIDIA GTX 960M

Server Machine:
    Ubuntu 16.04
    CUDA 10.1
    CuDNN 7.5
    OpenMPI 3.1.4
    GPU: NVIDIA Tesla K40c [2 GPUs]

任何人都可以建议我在做什么错吗?为什么同步似乎不适用于多个GPU?

更新01:经过一些修改,我尝试在服务器的单个GPU上运行代码,结果在通信时发生了相同的错误。有想法吗?

0 个答案:

没有答案