无法启动slurm数据库(slurmdbd)守护进程

时间:2017-11-20 09:46:25

标签: centos7 slurm accounting

Slurm 17.02.3不久前安装在我的集群上,但最近我决定使用SlurmDBD进行计费。

在CentOS 7中安装了几个软件包(slurm-devel,slurm-munge,slurm-perlapi,slurm-plugins,slurm-slurmdbd和slurm-sql)和MariaDB之后,我创建了一个SQL数据库:

mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost'
-> identified by 'some_pass' with grant option;
mysql> create database slurm_acct_db;

并配置了slurmdbd.conf文件:

AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=some_pass
StorageUser=slurm
StorageLoc=slurm_acct_db

然后,我停止了我的集群头节点上的slurmctl守护程序,并尝试启动slurmdbd,但我得到了以下内容:

$ systemctl start slurmdbd
Job for slurmdbd.service failed because the control process exited with error code. See "systemctl status slurmdbd.service" and "journalctl -xe" for details.
$ systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since lun 2017-11-20 10:39:26 CET; 53s ago
  Process: 27592 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE)

nov 20 10:39:26 nodo00 systemd[1]: Starting Slurm DBD accounting daemon...
nov 20 10:39:26 nodo00 systemd[1]: slurmdbd.service: control process exited, code=exited status=1
nov 20 10:39:26 nodo00 systemd[1]: Failed to start Slurm DBD accounting daemon.
nov 20 10:39:26 nodo00 systemd[1]: Unit slurmdbd.service entered failed state.
nov 20 10:39:26 nodo00 systemd[1]: slurmdbd.service failed.
$ journalctl -xe
nov 20 10:39:26 nodo00 polkitd[1078]: Registered Authentication Agent for unix-process:27586:119889015 (system bus name :1.871 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /or
nov 20 10:39:26 nodo00 systemd[1]: Starting Slurm DBD accounting daemon...
-- Subject: Unit slurmdbd.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit slurmdbd.service has begun starting up.
nov 20 10:39:26 nodo00 systemd[1]: slurmdbd.service: control process exited, code=exited status=1
nov 20 10:39:26 nodo00 systemd[1]: Failed to start Slurm DBD accounting daemon.
-- Subject: Unit slurmdbd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit slurmdbd.service has failed.
-- 
-- The result is failed.
nov 20 10:39:26 nodo00 systemd[1]: Unit slurmdbd.service entered failed state.
nov 20 10:39:26 nodo00 systemd[1]: slurmdbd.service failed.
nov 20 10:39:26 nodo00 polkitd[1078]: Unregistered Authentication Agent for unix-process:27586:119889015 (system bus name :1.871, object path /org/freedesktop/PolicyKit1/AuthenticationAgent,
nov 20 10:40:06 nodo00 gmetad[1519]: data_thread() for [HPCSIE] failed to contact node 192.168.2.10
nov 20 10:40:06 nodo00 gmetad[1519]: data_thread() got no answer from any [HPCSIE] datasource
nov 20 10:40:13 nodo00 dhcpd[2320]: DHCPREQUEST for 192.168.2.19 from 00:1e:67:48:79:d0 via enp6s0f1
nov 20 10:40:13 nodo00 dhcpd[2320]: DHCPACK on 192.168.2.19 to 00:1e:67:48:79:d0 via enp6s0f1
nov 20 10:40:39 nodo00 dhcpd[2320]: DHCPREQUEST for 192.168.2.13 from 00:1e:67:48:74:e0 via enp6s0f1
nov 20 10:40:39 nodo00 dhcpd[2320]: DHCPACK on 192.168.2.13 to 00:1e:67:48:74:e0 via enp6s0f1

我刚发现档案/var/run/slurmdbd.pid甚至不存在。

我很欣赏这个问题的任何暗示。

由于

1 个答案:

答案 0 :(得分:0)

只需创建pid文件:

  

触摸/var/run/slurmdbd.pid

并为您的slurm用户提供访问权限:

  

chown slurm:/var/run/slurmdbd.pid