使用扭矩运行作业时出现错误消息。 read_tcp_reply,不匹配的协议。预期的协议4但读取回复为0

时间:2016-12-06 12:49:39

标签: pbs torque

我的系统是Cent OS7,我安装了torque-6.1.0配置./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings

我的服务器名称是“node00”,我添加了一个名为“node01”的从属节点

[root@node00 torque]# pbsnodes
node01
     state = free
     power_state = Running
     np = 16
     ntype = cluster
     status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
     mom_service_port = 15002
     mom_manager_port = 15003

我提交了一份简单的工作echo "sleep 5" | qsub 然后它在qstat -f

中返回了一条错误消息
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
    log for exact details.
comment = Job started on Tue Dec 06 at 21:35

所以我读了母亲上级的日志vi /var/spool/torque/mom_logs/20161206

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02;   pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128;   pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02;   pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals

似乎node01node00无法相互发送数据。这样对吗?我该如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

关于标题文本:“ read_tcp_reply,协议不匹配。预期协议为4,但读取的答复为0” 这是在以下情况下在系统上显示的错误:

  1. pbs_mom在pbs_server未知的节点上运行(从节点文件中排除)
  2. 当/ var / spool / torque / server_priv / jobs目录被应在作业终止时删除的作业文件阻塞时(由于pbs_server进行清理非常糟糕,这很容易增长到数千个文件)。同样的情况适用于/ var / spool / torque / server_priv / arrays目录。
  3. 清除以上两种情况,仍然可以在具有400个节点和1000个作业(排队和/或正在运行)的系统上看到它。在这种情况下,它每小时发生5-10次。

在所有情况下,tcpdump在pbs_server端都显示妈妈在发送状态更新后被发送了tcp重置。可以很容易地通过以下方式进行跟踪:

    tcpdump -i <interface> tcp port 15001 and tcp[13]=4

    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0

On the node this is logged:
    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

更新: 最后,我们通过在文件/ var / spool / torque / server_priv / mom_hierarchy中实现MOM层次结构来解决该问题。 对于一个500节点的群集,我们定义了8个组(mom_hierarchy中的路径),其顶层为2个节点,一个级别为该组中的其余节点。像这样:

<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path> 
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....