Question

我正在使用MPI.NET库，我最近将我的应用程序移动到更大的群集（更多COMPUTE-NODES）。我已经开始看到各种集体功能无限期挂起，但有时只是。大约一半的工作时间完成，剩下的时间它就会挂起。我已经看到它与Scatter，Broadcast和Barrier一起发生。

我在应用程序的开头放了一个MPI.Communicator.world.Barrier()调用（MPI.NET），并创建了跟踪日志（使用MPIEXEC.exe /trace开关）。

C＃代码段：

static void Main(string[] args)
{
    var hostName = System.Environment.MachineName;
    Logger.Trace($"Program.Main entered on {hostName}");
    string[] mpiArgs = null;
    MPI.Environment myEnvironment = null;
    try
    {
        Logger.Trace($"Trying to instantiated on MPI.Environment on {hostName}. Is currently initialized? {MPI.Environment.Initialized}");
        myEnvironment = new MPI.Environment(ref mpiArgs);
        Logger.Trace($"Is currently initialized?{MPI.Environment.Initialized}. {hostName} is waiting at Barrier... ");
        Communicator.world.Barrier(); // CODE HANGS HERE!
        Logger.Trace($"{hostName} is past Barrier");
    }
    catch (Exception envEx)
    {
        Logger.Error(envEx, "Could not instantiate MPI.Environment object");
    }

    // rest of implementation here...

}

我可以在日志中看到msmpi.dll的{{1}}函数被调用，我可以看到之后发送和接收的消息，用于传递和失败的示例。对于传递的示例，发送/接收消息，然后记录MPI_Barrier函数Leave。

对于失败的示例，它看起来像一个（或多个）发送消息丢失 - 它永远不会被目标接收。我是否认为在MPI_Barrier调用中丢失的邮件意味着进程永远不会同步，因此所有问题都会在MPI_Barrier调用时停滞不前？

什么可能导致间歇性发生？ COMPUTE-NODES之间的网络性能差可能是一个原因吗？

我正在运行MS HPC Pack 2008 R2，所以MS-MPI的版本很旧，版本为2.0。

编辑 - 其他信息 如果我在同一节点内运行任务，则不会发生此问题。例如，如果我在一个节点上使用8个核心运行任务然后很好，但如果我在两个节点上使用9个核心，我会在50％的时间内看到这个问题。

此外，我们正在使用两个群集，这只发生在其中一个群集上。它们都是虚拟化环境，但似乎设置相同。

MS-MPI MPI_Barrier：有时会无限期挂起，有时则无法挂起

0 个答案: