扭矩无法与主机通讯

时间:2018-08-18 01:39:18

标签: linux cluster-computing torque

我一直在尝试为小型集群设置扭矩计划程序。我按照以下步骤从http://docs.adaptivecomputing.com/torque/archive/3-0-2/1.2configuring_torque_on_server.php

设置了调度程序

但是,当我尝试

qterm -t quick

我收到以下错误

$ sudo qterm -t quick
Unable to communicate with Terra(192.168.1.25)
Cannot connect to specified server host 'Terra'.
qterm: could not connect to server '' (111) Connection refused 

但是服务器启动正常。但是,当我尝试运行在多个节点上运行的命令时,例如

qsub -l nodes=2:ppn=4 /home/user/scripts/someScript

它打印出类似的东西

7.Terra

其中Terra是头节点的名称,但也是集群中的节点。这不是问题。问题是它无法运行。在任何地方也没有任何输出:/

扭矩服务器日志:https://ptpb.pw/EaKo

terra节点日志:https://ptpb.pw/9w5M

和Marte日志:https://ptpb.pw/o4PT

我可以让它与pbs脚本一起运行,但只能与一个节点一起运行。...

#!/bin/bash
#PBS -l pmem=1gb,nodes=1:ppn=4
#PBS -m abe
cd Documents/
wc -l largeTest.csv

这是qstat提交工作后的结果

Job ID                    Name             User            Time Use S 
Queue
------------------------- ---------------- --------------- -------- - -----
16.Terra                   testPerformance  justin                 0 R batch      

pbsnodes -a的输出

Terra
 state = free
 power_state = Running
 np = 4
 properties = Tower
 ntype = cluster
 status = opsys=linux,uname=Linux Terra 4.17.14-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 9 11:56:50 UTC 2018 x86_64,sessions=11525 22029,nsessions=2,nusers=1,idletime=57964,totmem=8111556kb,availmem=7539284kb,physmem=8111556kb,ncpus=4,loadave=0.00,gres=,netload=30570521372,state=free,varattr= ,cpuclock=Fixed,macaddr=e0:3f:49:44:72:20,version=6.1.1.1,rectime=1534937388,jobs=
 mom_service_port = 15002
 mom_manager_port = 15003
 gpus = 1

Marte
 state = free
 power_state = Running
 np = 4
 properties = NFSServer
 ntype = cluster
 status = opsys=linux,uname=Linux Marte 4.18.1-arch1-1-ARCH #1 SMP PREEMPT Wed Aug 15 21:11:55 UTC 2018 x86_64,sessions=366 556 563,nsessions=3,nusers=2,idletime=58140,totmem=7043404kb,availmem=6703808kb,physmem=7043404kb,ncpus=4,loadave=0.02,gres=,netload=36500663511,state=free,varattr= ,cpuclock=Fixed,macaddr=c8:5b:76:4a:65:91,version=6.1.1.1,rectime=1534937359,jobs=
 mom_service_port = 15002
 mom_manager_port = 15003

和/ var / spool / torque / server_priv / nodes

Terra np=4 gpus=1 Tower
Marte np=4 NFSServer

编辑:这也是最新的日志

节点的妈妈日志:https://ptpb.pw/DhKi

头节点https://ptpb.pw/MTlD的妈妈日志

和服务器日志:https://ptpb.pw/HPkE

0 个答案:

没有答案