在群集上以sudo权限运行MPI程序

时间:2019-07-08 23:00:05

标签: linux sockets raspberry-pi mpi openmpi

我正在研究小型Raspberry PI群集,我的主机程序创建IP数据包片段并将其发送到多个中继程序。中继接收这些数据包片段,然后使用原始套接字将其转发到目的地。由于存在原始套接字,因此我的中继程序必须以sudo权限运行。我的设置涉及RPi 3 B v2和RPi 2 B v1。 SSH已经设置好了,尽管我必须在每个节点上运行ssh-agent并ssh-add我的密钥,但是节点可以不使用密码进行SSH输入。我设法运行程序从一个节点到另一个节点(2个不同的RPis)发送等级。我只以MPMD方式运行MPI程序,因为我只有2个RPis,所以我在节点#1上运行主机和中继,在节点#2上运行中继。主机程序采用文件的路径作为命令行参数发送。

如果我跑步:

mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 /home/pi/Desktop/relay

它可以运行,但是显然程序会失败,因为没有sudo权限,中继无法打开原始套接字。

如果我跑步:

mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo /home/pi/Desktop/relay

中继报告的世界大小:1,并且主机程序挂起。

如果我跑步:

mpirun --oversubscribe -n 1 --host localhost sudo /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo /home/pi/Desktop/relay

所有中继和主机报告的世界大小为1。

我在这里发现了类似的问题:OpenMPI / mpirun or mpiexec with sudo permission

以下是我的简短回答:

mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo -E /home/pi/Desktop/relay

结果:

[raspberrypi:00979] OPAL ERROR: Unreachable in file ext2x_client.c at line 109
[raspberrypi:00980] OPAL ERROR: Unreachable in file ext2x_client.c at line 109
*** An error occurred in MPI_Init
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[raspberrypi:00979] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[raspberrypi:00980] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32582,1],1]
  Exit code:    1
--------------------------------------------------------------------------

我已经运行sudo visudo,并且两个节点上的文件都像这样:

# User privilege specification
root    ALL=(ALL:ALL) ALL
pi      ALL = NOPASSWD:SETENV:  /etc/alternatives/mpirun
pi      ALL=NOPASSWD:SETENV:    /usr/bin/orterun
pi      ALL=NOPASSWD:SETENV:    /usr/bin/mpirun

当我在一个节点上运行所有内容时,它就可以工作:

sudo mpirun --alow-run-as-root --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,localhost /home/pi/Desktop/relay //主机

int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    int world_size = []() {
        int size;
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        return size;
    }();

    int id = []() {
        int id;
        MPI_Comm_rank(MPI_COMM_WORLD, &id);
        return id;
    }();

    if (argc != 2) {
        std::cerr << "Filepath not passed\n";
        MPI_Finalize();
        return 0;
    }

    const std::filesystem::path filepath(argv[1]);
    if (not std::filesystem::exists(filepath)) {
        std::cerr << "File doesn't exist\n";
        MPI_Finalize();
        return 0;
    }

    std::cout << "World size: " << world_size << '\n';

    MPI_Finalize();
    return 0;
}

//relay
int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    int world_size = []() {
        int size;
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        return size;
    }();

    int id = []() {
        int id;
        MPI_Comm_rank(MPI_COMM_WORLD, &id);
        return id;
    }();

    std::cout << "World size: " << world_size << '\n';

    MPI_Finalize();
    return 0;
}

如何配置节点以允许它们使用sudo运行MPI程序?

1 个答案:

答案 0 :(得分:0)

解决问题的最简单方法是设置文件的功能,它仍然带来安全性问题,但是不像将程序的suid设置为root那样严重。要设置允许打开原始套接字的程序功能:setcap program cap_net_raw,cap_net_admin+eip