扭转PBS作业进入调试队列

时间:2016-04-08 23:04:41

标签: hpc pbs torque

在我的新工作中,我管理一个使用扭矩作为资源管理器和maui作为调度程序的集群。

目前,我面临着一个重复的问题,即特定用户作业总是被发送到调试队列。以下是系统上活动队列的列表:

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
debug              --      --    00:20:00   --    0   0 12   E R
intel              --      --       --      --    0   0 --   E R
medium             --      --    72:00:00   --    0   0 12   E R
bighuge            --      --       --      --    0   0 --   E R
long               --      --       --      --    0   0 12   E R
                                               ----- -----
                                                   0     0

用户提交的作业的Wall-time是几个小时,所以我很困惑为什么要将它发送到调试队列。

此外,这是tracejob的输出:

04/08/2016 15:46:48  S    enqueuing into intel, state 1 hop 1
04/08/2016 15:46:48  S    dequeuing from intel, state QUEUED
04/08/2016 15:46:48  S    enqueuing into debug, state 1 hop 1
04/08/2016 15:46:48  S    Job Queued at request of dawn@cm01, owner = dawn@cm01, job name = run01_submit.script, queue =
                          debug
04/08/2016 15:46:49  S    Job Run at request of root@cm01
04/08/2016 15:46:49  S    child reported success for job after 0 seconds (dest=n20), rc=0
04/08/2016 15:46:49  S    preparing to send 'b' mail for job 15631.cm01 to dawn@cm01 (---)
04/08/2016 15:46:49  S    Not sending email: User does not want mail of this type.
04/08/2016 15:46:49  S    obit received - updating final job usage info
04/08/2016 15:46:49  S    job exit status 1 handled
04/08/2016 15:46:49  S    preparing to send 'e' mail for job 15631.cm01 to dawn@cm01 (Exit_status=1
04/08/2016 15:46:49  S    Not sending email: User does not want mail of this type.
04/08/2016 15:46:49  S    Exit_status=1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:00
04/08/2016 15:46:49  S    on_job_exit task assigned to job
04/08/2016 15:46:49  S    req_jobobit completed
04/08/2016 15:46:49  S    JOB_SUBSTATE_EXITING
04/08/2016 15:46:49  S    JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49  S    about to copy stdout/stderr/stageout files
04/08/2016 15:46:49  S    JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49  S    JOB_SUBSTATE_STAGEDEL
04/08/2016 15:46:49  S    JOB_SUBSTATE_EXITED
04/08/2016 15:46:49  S    JOB_SUBSTATE_COMPLETE
04/08/2016 15:50:54  S    Request invalid for state of job COMPLETE
04/08/2016 15:51:00  S    Request invalid for state of job COMPLETE
04/08/2016 15:51:49  S    dequeuing from debug, state COMPLETE

现在解决方法是使用qalter命令手动更改作业的已分配队列。

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

因为作业立即从intel队列跳转到调试,我怀疑你在qmgr或Maui中配置了自动路由。如果将intel队列配置为路由队列,则可以解释它。

运行qmgr -c "print queue intel"检查。

如果它不是路由队列,您可以增加loglevel以更好地查看pbs_server日志中发生的情况。

当我以这种方式创建路由队列时,我在提交作业时会得到相同类型的tracejob输出:

05/20/2016 20:04:05.439 S enqueuing into route, state 1 hop 1 05/20/2016 20:04:05.440 S dequeuing from route, state QUEUED 05/20/2016 20:04:05.440 S enqueuing into test, state 1 hop 1 05/20/2016 20:04:05.737 S Job Run at request of root@testserver

否则,请检查Maui配置并记录线索。