Question

我们正在运行一个小型集群环境，其中Intel Xeon节点通过Infiniband连接。登录节点未连接到infiniband互连。所有节点都运行Debian Jessie。

我们在Login节点上运行Slurm 14.03.9。由于系统OpenMPI已过时且不支持MPI3接口（我需要），因此我编译了一个自定义的OpenMPI 2.0.1。

当我通过

手动启动MPI作业时

mpirun --hostfile hosts -np xx program_name,

它运行良好，也可以在多个节点上运行，并充分利用Infiniband。好。

但是，当我从Slurm脚本中调用我的MPI应用程序时，它会与奇怪的Segfaults崩溃。我使用Slurm支持编译OpenMPI，并且PMI似乎也可以工作，所以我可以简单地编写

mpirun program_name

在Slurm脚本中

，它会自动将作业调度到具有正确CPU核心数的正确节点。但是，我不断收到这些段错误。

明确指定＆＃34; -np＆＃34;和＆＃34; - 主机文件＆＃34;在Slurm脚本中的mpirun也无济于事。在Slurm环境中启动时，手动启动时运行正常的完全相同的命令会导致段错误。

在发生段错误之前，我从OpenMPI收到以下错误消息：

--------------------------------------------------------------------------
Failed to create a completion queue (CQ):

Hostname: xxxx
Requested CQE: 16384
Error:    Cannot allocate memory

Check the CQE attribute.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: xxxx
--------------------------------------------------------------------------

我用Google搜索，但没有找到太多有用的信息。我假设它可能是锁定内存的限制，但执行＆＃34; ulimit -l＆＃34;在计算节点上返回＆＃34;无限制＆＃34;应该如此。

感谢任何帮助我在Slurm环境中使用OpenMPI运行我的工作。

Answer 1

最后，我能够解决问题。

段错误确实与上面发布的错误消息有关，这是因为Slurm调度作业的计算节点上存在“最大锁定内存”限制。

我挣扎了很长时间才解除这个锁定内存限制。通过Google找到的所有标准程序都不起作用（既不编辑/etc/security/limits.conf也不编辑/etc/init.d/slurmd）。原因是我的Debian Jessie节点使用systemd，它不尊重这些文件。我不得不添加一行

[Service]
LimitMEMLOCK=32768000000

到我所有节点上的文件/etc/systemd/system/multi-user.target.wants/slurmd.service。它不适用于unlimited，因此我不得不以字节为单位使用总系统RAM。修改此文件后，我执行了

systemctl daemon-reload
systemctl restart slurmd

在所有节点上，最后问题消失了。感谢Carles Fenoy，感谢您的宝贵意见！

在Slurm运行脚本中运行OpenMPI作业时出现段错误

1 个答案: