Question

我有一个使用OpenMP的其他人编写的程序。我正在使用Slurm作为其作业管理器的群集上运行它。尽管设置OMP_NUM_THREADS = 72并为该作业正确请求72个内核，该作业仅使用四个内核。

我已经使用scontrol show job <job_id> --details来验证是否有72个内核分配给该作业。我还远程访问了作业正在运行的节点，并使用htop对其进行了检查。它运行了72个线程，全部在四个内核上。值得注意的是，这是在SMT4 power9 cpu上进行的，这意味着每个物理核心都同时执行4个线程。最终，openMP似乎将所有线程都放在一个物理内核上。这是IBM系统，这使情况进一步复杂化。我似乎找不到任何有用的文档来更好地控制openMP环境。我发现的所有东西都是给英特尔的。

我还尝试使用任务集手动更改亲和力。这按预期工作，并将其中一个线程移至未使用的内核。此后，程序继续按预期工作。

从理论上讲，我可以编写一个脚本来查找所有线程，然后调用任务集以逻辑方式将它们分配给内核，但是我害怕这样做。对我来说似乎是个坏主意。还需要一段时间。

我想我的主要问题是，这是Slurm问题，openMP问题，IBM问题还是用户错误？是否需要设置一些我不知道的环境变量？如果我使用脚本手动调用任务集，会破坏Slurm吗？如果这样做，我将使用scontrol来确定将哪个cpus分配给该作业。我不想通过弄乱事情来激怒运行集群的人。

这是提交脚本。由于许可证问题，我无法包含任何实际运行的代码。我希望这只是修复环境变量的简单问题。 MPI_OPTIONS变量由系统管理员推荐。如果碰巧这里有人曾与ENKI集群一起工作过，那就是运行它的地方。

wrk_path=${PWD}

cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path

cd $wrk_path  

module load openmpi/3.1.3/2019
module load pgi/2019

export OMP_NUM_THREADS=72
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat

!

sbatch slurm.sh

Answer 1

编辑：使用72核时，修复仅导致7倍的加速，而仅在4核上运行。考虑到正在运行的计算的性质，这很好。

编辑2：使用160时与仅在4个内核上运行相比，修复导致17倍的加速。

这可能不适用于每个人，但是我有一个真正的解决方案。我编写了一个python脚本，该脚本使用psutil查找正在运行的进程的所有子线程，并手动设置其亲和力。该脚本使用scontrol找出分配给作业的cpus，并使用任务集强制线程在这些cpus上分配。

到目前为止，该过程的运行速度要快得多。我敢肯定，强制执行CPU亲和力不是最好的方法，但比根本不使用可用资源要好得多。

这是代码背后的基本思想。我正在运行的程序称为pgmc，因此为变量名。如果您在像我的系统上运行，则需要创建一个安装了psutil的anaconda环境。

import psutil
import subprocess
import os
import sys
import time

# Gets the id for the current job.
def get_job_id():
    return os.environ["SLURM_JOB_ID"]

# Returns a list of processors assigned to the job and the total number of cpus
# assigned to the job.
def get_proc_info():
    run_str = 'scontrol show job %s --details'%get_job_id()
    stdout  = subprocess.getoutput(run_str)

    id_spec  = None
    num_cpus = None

    chunks = stdout.split(' ')
    for chunk in chunks:
        if chunk.lower().startswith("cpu_ids"):
            id_spec     = chunk.split('=')[1]
            start, stop = id_spec.split('-')
            id_spec     = list(range(int(start), int(stop) + 1))
        if chunk.lower().startswith('numcpus'):
            num_cpus = int(chunk.split('=')[1])

    if id_spec is not None and num_cpus is not None:
        return id_spec, num_cpus 

    raise Exception("Couldn't find information about the allocated cpus.")

if __name__ == '__main__':
    # Before we do anything, make sure that we can get the list of cpus 
    # assigned to the job. Once we have that, run the command line supplied.

    cpus, cpu_count = get_proc_info()

    if len(cpus) != cpu_count:
        raise Exception("CPU list didn't match CPU count.")

    # If we successefully got to here, run the command line.
    program_name = ' '.join(sys.argv[1:])

    pgmc = subprocess.Popen(sys.argv[1:])
    time.sleep(10)
    pid  = [proc for proc in psutil.process_iter() if proc.name() == "your program name here"][0].pid

    # Now that we have the pid of the pgmc process, we need to get all
    # child threads of the process.

    pgmc_proc    = psutil.Process(pid)
    pgmc_threads = list(pgmc_proc.threads())

    # Now that we have a list of threads, we loop over available cores and
    # assign threads to them. Once this is done, we wait for the process
    # to complete.

    while len(pgmc_threads) != 0:
        for core_id in cpus:
            if len(pgmc_threads) != 0:
                thread_id      = pgmc_threads[-1].id
                pgmc_threads.remove(pgmc_threads[-1])
                taskset_string = 'taskset -cp %i %i'%(core_id, thread_id)
                print(taskset_string)
                subprocess.getoutput(taskset_string)
            else:
                break

    # All of the threads should now be assigned to a core.
    # Wait for the process to exit.
    pgmc.wait()
    print("program terminated, exiting . . . ")

这是使用的提交脚本。

wrk_path=${PWD}

cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path

cd $wrk_path  

module purge
module load openmpi/3.1.3/2019
module load pgi/2019
module load anaconda3
# This is the anaconda environment I created with psutil installed.
conda activate psutil-node


export OMP_NUM_THREADS=72
# The two MPI_OPTIONS lines are specific to this cluster if I'm not mistaken.
# You probably won't need them.
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time python3 affinity_set.py mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat

!

sbatch slurm.sh

我包含提交脚本的主要原因是演示python脚本的使用方式。更具体地说，您以实际工作作为参数来调用它。

OpenMP代码仅使用4个线程，而不是指定的72

1 个答案: