Question

在我的应用程序中，我使用infiniband基础架构将数据流从服务器发送到另一台服务器。我已经习惯了通过infiniband轻松开发ip，因为我对套接字编程比较熟悉。到目前为止，性能（max bw）对我来说足够好（我知道我没有达到可实现的最大带宽），现在我需要从infiniband连接中获得更多带宽。

ib_write_bw声称我的最大可实现带宽大约是1500 MB / s（我没有达到3000MB / s，因为我的卡安装在PCI 2.0 8x中）。

到目前为止一切顺利。我使用ibverbs和rdma编写了我的通信通道，但是我得到的带宽远远小于我能得到的带宽，我甚至比使用套接字的带宽要少一些，但至少我的应用程序不使用任何CPU功率：

ib_write_bw：1500 MB / s

套接字：700 MB / s＆lt; =在此测试期间，我的系统的一个核心是100％

ibvers + rdma：600 MB / s＆lt; =在此测试期间根本没有使用CPU

似乎瓶颈在这里：

ibv_sge sge;
sge.addr = (uintptr_t)memory_to_transfer;
sge.length = memory_to_transfer_size;
sge.lkey = memory_to_transfer_mr->lkey;

ibv_send_wr wr;
memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;

ibv_send_wr *bad_wr = NULL;
if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
  notifyError("Unable to ibv post receive");
}

此时下一个等待completation的代码是：

//Wait for completation
ibv_cq *cq;
void* cq_context;
if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
  notifyError("Unable to get a ibv cq event");
}

ibv_ack_cq_events(cq, 1);

if (ibv_req_notify_cq(cq, 0) != 0) {
  notifyError("Unable to get a req notify");
}

ibv_wc wc;
int myRet = ibv_poll_cq(cq, 1, &wc);
if (myRet > 1) {
  LOG(WARNING) << "Got more than a single ibv_wc, expecting one";
}

从我的ibv_post_send和ibv_get_cq_event返回事件时的时间是13.3ms，当传输8 MB的块时达到大约600 MB / s。

要指定更多（在伪代码中我的全局操作）：

活跃的一面：

post a message receive
rdma connection
wait for rdma connection event
<<at this point transfer tx flow starts>>
start:
register memory containing bytes to transfer
wait remote memory region addr/key ( I wait for a ibv_wc)
send data with ibv_post_send
post a message receive
wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
send message "DONE"
unregister memory 
goto start

被动方：

post a message receive
rdma accept
wait for rdma connection event
<<at this point transfer rx flow starts>>
start:
register memory that has to receive the bytes
send addr/key of memory registered
wait "DONE" message 
unregister memory
post a message receive
goto start

有谁知道我做错了什么？或者我可以改进什么？我没有受到“Not Invented Here”综合症的影响，所以我甚至愿意抛弃我迄今为止所做的事并采用别的东西。我只需要点对点连续转移。

Answer 1

根据您的伪代码，看起来您为每次传输注册和取消注册内存区域。我认为这可能是事情进展缓慢的主要原因：内存注册是一项非常昂贵的操作，因此您希望尽可能少地执行此操作并尽可能多地重用内存区域。注册内存所花费的所有时间都是您不花费时间传输数据的时间。

这指出了您的伪代码的第二个问题：您正在同步等待完成，而不是在上一个工作请求完成之前发布另一个工作请求。这意味着在从工作请求完成到完成并发布另一个请求的时间内，HCA处于空闲状态。在飞行中保留多个发送/接收工作请求要好得多，这样当HCA完成一个工作请求时，它可以立即移动到下一个工作请求。

Answer 2

我解决了将我的缓冲区分配给页面大小的问题。在我的系统页面大小是4K（sysconf（_SC_PAGESIZE）返回的值）。这样做我能够（我仍然进行注册/取消注册）现在达到大约1400 MB /秒。

infiniband rdma差转移bw

2 个答案: