我一直在尝试为小型集群设置扭矩计划程序。我按照以下步骤从http://docs.adaptivecomputing.com/torque/archive/3-0-2/1.2configuring_torque_on_server.php
设置了调度程序但是,当我尝试
qterm -t quick
我收到以下错误
$ sudo qterm -t quick
Unable to communicate with Terra(192.168.1.25)
Cannot connect to specified server host 'Terra'.
qterm: could not connect to server '' (111) Connection refused
但是服务器启动正常。但是,当我尝试运行在多个节点上运行的命令时,例如
qsub -l nodes=2:ppn=4 /home/user/scripts/someScript
它打印出类似的东西
7.Terra
其中Terra是头节点的名称,但也是集群中的节点。这不是问题。问题是它无法运行。在任何地方也没有任何输出:/
扭矩服务器日志:https://ptpb.pw/EaKo
terra节点日志:https://ptpb.pw/9w5M
和Marte日志:https://ptpb.pw/o4PT
我可以让它与pbs脚本一起运行,但只能与一个节点一起运行。...
#!/bin/bash
#PBS -l pmem=1gb,nodes=1:ppn=4
#PBS -m abe
cd Documents/
wc -l largeTest.csv
这是qstat
提交工作后的结果
Job ID Name User Time Use S
Queue
------------------------- ---------------- --------------- -------- - -----
16.Terra testPerformance justin 0 R batch
pbsnodes -a的输出
Terra
state = free
power_state = Running
np = 4
properties = Tower
ntype = cluster
status = opsys=linux,uname=Linux Terra 4.17.14-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 9 11:56:50 UTC 2018 x86_64,sessions=11525 22029,nsessions=2,nusers=1,idletime=57964,totmem=8111556kb,availmem=7539284kb,physmem=8111556kb,ncpus=4,loadave=0.00,gres=,netload=30570521372,state=free,varattr= ,cpuclock=Fixed,macaddr=e0:3f:49:44:72:20,version=6.1.1.1,rectime=1534937388,jobs=
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1
Marte
state = free
power_state = Running
np = 4
properties = NFSServer
ntype = cluster
status = opsys=linux,uname=Linux Marte 4.18.1-arch1-1-ARCH #1 SMP PREEMPT Wed Aug 15 21:11:55 UTC 2018 x86_64,sessions=366 556 563,nsessions=3,nusers=2,idletime=58140,totmem=7043404kb,availmem=6703808kb,physmem=7043404kb,ncpus=4,loadave=0.02,gres=,netload=36500663511,state=free,varattr= ,cpuclock=Fixed,macaddr=c8:5b:76:4a:65:91,version=6.1.1.1,rectime=1534937359,jobs=
mom_service_port = 15002
mom_manager_port = 15003
和/ var / spool / torque / server_priv / nodes
Terra np=4 gpus=1 Tower
Marte np=4 NFSServer
编辑:这也是最新的日志
节点的妈妈日志:https://ptpb.pw/DhKi
头节点https://ptpb.pw/MTlD的妈妈日志
和服务器日志:https://ptpb.pw/HPkE