我一直在尝试在一台机器上安装slurm来验证我工作的一些问题。我正在使用Linux mint 18.3和slurm 14.11.8,因为我要上班的机器有这个版本,但是当我运行时:
systemctl start slurmctld
生成此错误:
slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since mar 2018-03-20 21:19:11 COT; 3s ago
Process: 2862 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 1005 (code=exited, status=1/FAILURE)
mar 20 21:19:11 fabianleon systemd[1]: Starting Slurm controller daemon...
mar 20 21:19:11 fabianleon systemd[1]: slurmctld.service: Control process exited, code=exited status=1
mar 20 21:19:11 fabianleon systemd[1]: Failed to start Slurm controller daemon.
mar 20 21:19:11 fabianleon systemd[1]: slurmctld.service: Unit entered failed state.
mar 20 21:19:11 fabianleon systemd[1]: slurmctld.service: Failed with result 'exit-code'.
使用这个slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=compute-cluster
ControlMachine=fabianleon
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
PluginDir=/usr/lib/slurm
#FirstJobId=
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/cgroup
#JobAcctGatherFrequency=30
#
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
#AccountingStorageLoc=
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageUser=slurm
#
# COMPUTE NODES
NodeName=fabianleon CPUs=1 RealMemory=1000 State=UNKNOWN
PartitionName=debug Nodes=fabianleon Default=YES MaxTime=INFINITE State=UP
我尝试在配置中使用其他文件,但会生成此错误:
slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: failed (Result: resources) since mar 2018-03-20 21:22:02 COT; 2s ago
Process: 2902 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1005 (code=exited, status=1/FAILURE)
mar 20 21:22:02 fabianleon systemd[1]: Starting Slurm controller daemon...
mar 20 21:22:02 fabianleon systemd[1]: slurmctld.service: PID 2904 read from file /var/run/slurmctld.pid does not exist or is a zombie.
mar 20 21:22:02 fabianleon systemd[1]: Failed to start Slurm controller daemon.
mar 20 21:22:02 fabianleon systemd[1]: slurmctld.service: Unit entered failed state.
mar 20 21:22:02 fabianleon systemd[1]: slurmctld.service: Failed with result 'resources'.
和slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=fabianleon
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=fabianleon CPUs=1 RealMemory=1000 State=UNKNOWN
PartitionName=debug Nodes=fabianleon Default=YES MaxTime=INFINITE State=UP
答案 0 :(得分:0)
文件/var/run/slurmctld.pid是否存在? 权限应该是
-rw-r--r-- 1 slurm root /var/run/slurmctld.pid