我正在尝试在由HyperV管理的虚拟机上运行Slurm计算节点。 该节点运行Ubuntu 16.04。
$PackageID = $Package | ForEach-Obejct {
$_.SelectNodes('def:S[contains(@N,"PackageID")]',$ns)
}
显示:
slurmd -C
这不是绝对正确的,该计算机可用的最大RAM量为96Gb,但是RAM是由HyperV根据要求分配的。如果没有负载,则该节点只有16 Gb。
我尝试运行一些处理大型数据集的python脚本,而不会感到困惑,并且看到最大RAM增加到96Gb。
我的NodeName=calc1 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=48 ThreadsPerCore=1 RealMemory=16013
UpTime=5-20:51:31
(包括其他行)中包含以下内容:
slurmd.conf
但是,SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
FastSchedule=1
DefMemPerCPU=2048
NodeName=calc1 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=48 ThreadsPerCore=1 RealMemory=96000 CoreSpecCount=8 MemSpecLimit=6000
显示仅加载了8个核心,有40个处于空闲状态。而Mem只有16Gb。
有时,由于“内存不足”,节点进入htop
状态。
看起来slurmd不相信我在Drained
如何使Slurmd请求其他GB的RAM?
更新
我仍然没有应用@Carles Fenoy提出的配置更改,但是观察到一个奇怪的细节。
slurm.conf
的输出:
scontrol show node
然后我将SSH转换为calc1并发出NodeName=calc1 Arch=x86_64 CoresPerSocket=48
CPUAlloc=40 CPUErr=0 CPUTot=48 CPULoad=10.25
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=calc1 NodeHostName=calc1 Version=17.11
OS=Linux 4.4.0-145-generic #171-Ubuntu SMP Tue Mar 26 12:43:40 UTC 2019
RealMemory=96000 AllocMem=81920 FreeMem=179 Sockets=1 Boards=1
CoreSpecCount=8 CPUSpecList=40-47 MemSpecLimit=6000
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=main
BootTime=2019-04-12T12:50:39 SlurmdStartTime=2019-04-18T09:24:29
CfgTRES=cpu=48,mem=96000M,billing=48
AllocTRES=cpu=40,mem=80G
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
。这是它的输出:
free -h
更新2 我已经与我们的基础架构专家讨论了此问题,并发现该机制称为Hyper-V Dynamic Memory
将尝试查找Microsoft是否向虚拟机提供任何API。也许我会很幸运,有人为它开发了Slurm插件。
答案 0 :(得分:2)
将FastSchedule
参数更改为0
或2
。
这是slurm.conf文档的摘录:
FastSchedule Controls how a node's configuration specifications in slurm.conf are used. If the number of node configuration entries in the configuration file is significantly lower than the number of nodes, setting FastSchedule to 1 will permit much faster scheduling decisions to be made. (The scheduler can just check the values in a few configuration records instead of possibly thousands of node records.) Note that on systems with hyper-threading, the processor count reported by the node will be twice the actual processor count. Consider which value you want to be used for scheduling purposes. 0 Base scheduling decisions upon the actual configuration of each individual node except that the node's processor count in Slurm's configuration must match the actual hardware configuration if PreemptMode=sus- pend,gang or SelectType=select/cons_res are configured (both of those plugins maintain resource allocation information using bitmaps for the cores in the system and must remain static, while the node's memory and disk space can be established later). 1 (default) Consider the configuration of each node to be that specified in the slurm.conf configuration file and any node with less than the configured resources will be set to DRAIN. 2 Consider the configuration of each node to be that specified in the slurm.conf configuration file and any node with less than the configured resources will not be set DRAIN. This option is generally only useful for testing purposes.