我的系统是Cent OS7,我安装了torque-6.1.0
配置./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings
我的服务器名称是“node00”,我添加了一个名为“node01”的从属节点
[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003
我提交了一份简单的工作echo "sleep 5" | qsub
然后它在qstat -f
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35
所以我读了母亲上级的日志vi /var/spool/torque/mom_logs/20161206
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals
似乎node01
和node00
无法相互发送数据。这样对吗?我该如何解决这个问题?
答案 0 :(得分:0)
关于标题文本:“ read_tcp_reply,协议不匹配。预期协议为4,但读取的答复为0” 这是在以下情况下在系统上显示的错误:
在所有情况下,tcpdump在pbs_server端都显示妈妈在发送状态更新后被发送了tcp重置。可以很容易地通过以下方式进行跟踪:
tcpdump -i <interface> tcp port 15001 and tcp[13]=4
08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0
On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
更新: 最后,我们通过在文件/ var / spool / torque / server_priv / mom_hierarchy中实现MOM层次结构来解决该问题。 对于一个500节点的群集,我们定义了8个组(mom_hierarchy中的路径),其顶层为2个节点,一个级别为该组中的其余节点。像这样:
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....