错误启动服务slurm

时间:2018-03-21 02:43:00

标签: slurm

我一直在尝试在一台机器上安装slurm来验证我工作的一些问题。我正在使用Linux mint 18.3和slurm 14.11.8,因为我要上班的机器有这个版本,但是当我运行时:

systemctl start slurmctld

生成此错误:

slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since mar 2018-03-20 21:19:11 COT; 3s ago
  Process: 2862 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 1005 (code=exited, status=1/FAILURE)

mar 20 21:19:11 fabianleon systemd[1]: Starting Slurm controller daemon...
mar 20 21:19:11 fabianleon systemd[1]: slurmctld.service: Control process exited, code=exited status=1
mar 20 21:19:11 fabianleon systemd[1]: Failed to start Slurm controller daemon.
mar 20 21:19:11 fabianleon systemd[1]: slurmctld.service: Unit entered failed state.
mar 20 21:19:11 fabianleon systemd[1]: slurmctld.service: Failed with result 'exit-code'.

使用这个slurm.conf

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=compute-cluster
ControlMachine=fabianleon
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
PluginDir=/usr/lib/slurm
#FirstJobId=
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/cgroup
#JobAcctGatherFrequency=30
#
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
#AccountingStorageLoc=
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageUser=slurm
#
# COMPUTE NODES

NodeName=fabianleon CPUs=1 RealMemory=1000 State=UNKNOWN 
PartitionName=debug Nodes=fabianleon Default=YES MaxTime=INFINITE State=UP

我尝试在配置中使用其他文件,但会生成此错误:

slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
   Active: failed (Result: resources) since mar 2018-03-20 21:22:02 COT; 2s ago
  Process: 2902 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1005 (code=exited, status=1/FAILURE)

mar 20 21:22:02 fabianleon systemd[1]: Starting Slurm controller daemon...
mar 20 21:22:02 fabianleon systemd[1]: slurmctld.service: PID 2904 read from file /var/run/slurmctld.pid does not exist or is a zombie.
mar 20 21:22:02 fabianleon systemd[1]: Failed to start Slurm controller daemon.
mar 20 21:22:02 fabianleon systemd[1]: slurmctld.service: Unit entered failed state.
mar 20 21:22:02 fabianleon systemd[1]: slurmctld.service: Failed with result 'resources'.

和slurm.conf

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=fabianleon
#ControlAddr=
# 
#MailProg=/bin/mail 
MpiDefault=none
#MpiParams=ports=#-# 
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818 
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root 
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTimeout=300 
# 
# 
# SCHEDULING 
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321 
SelectType=select/linear
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3 
#SlurmctldLogFile=
#SlurmdDebug=3 
#SlurmdLogFile=
# 
# 
# COMPUTE NODES 
NodeName=fabianleon CPUs=1 RealMemory=1000 State=UNKNOWN 
PartitionName=debug Nodes=fabianleon Default=YES MaxTime=INFINITE State=UP

1 个答案:

答案 0 :(得分:0)

文件/var/run/slurmctld.pid是否存在? 权限应该是

-rw-r--r-- 1 slurm root /var/run/slurmctld.pid