我成功安装了pbs服务器,启动了服务并可以使用pbsnodes命令查看节点。队列正在qstat -q命令中正确显示。提交测试作业后,我的sched_log,server_log和mom节点mom_log文件中出现以下内容:
sched_log:
08/16/2017 14:18:48.476;64; pbs_sched.19885;Job;2.headnode;Job Run
08/16/2017 14:19:28.215;02; pbs_sched.19885;Req;headnode3;Can not open connection to mom
08/16/2017 14:19:28.215;02; pbs_sched.19885;Req;headnode4;Can not open connection to mom
08/16/2017 14:19:28.238;02; pbs_sched.19885;Req;headnode5;Can not open connection to mom
08/16/2017 14:19:28.239;02; pbs_sched.19885;Req;headnode6;Can not open connection to mom
server_log:
08/16/2017 14:40:37.829;01;PBS_Server.27737;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.233:15003]
08/16/2017 14:40:37.829;01;PBS_Server.27739;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.232:15003]
08/16/2017 14:40:37.829;01;PBS_Server.27793;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.235:15003]
08/16/2017 14:40:38.828;01;PBS_Server.27736;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.234:15003]
mom_log:
08/16/2017 18:50:36.215;01; pbs_mom.10833;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 11123 MOM status update intervals
08/16/2017 18:51:22.308;01; pbs_mom.10838;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
08/16/2017 18:51:22.308;01; pbs_mom.10838;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 11124 MOM status update intervals
08/16/2017 18:52:06.402;01; pbs_mom.10859;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status update successfully sent after 11124 MOM status update intervals
08/16/2017 18:53:21.555;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 18:58:26.182;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:03:31.815;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:08:31.407;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:13:37.039;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:18:41.670;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:23:46.455;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
如何解决这个问题?是否由于任何类型的身份验证失败?在这种情况下,我应该设置ssh密钥验证登录吗?
有趣的是,我有另一台名为headnode2的服务器,其中ip .89.231没有显示任何错误。我没有采取任何额外步骤来配置那个。
答案 0 :(得分:1)
您可能只需要配置防火墙。我跑了
deleteSpreadsheets().then....
在服务器上和一个测试节点上,然后将作业提交到该节点以查看它是否运行。