在我的新工作中,我管理一个使用扭矩作为资源管理器和maui作为调度程序的集群。
目前,我面临着一个重复的问题,即特定用户作业总是被发送到调试队列。以下是系统上活动队列的列表:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
debug -- -- 00:20:00 -- 0 0 12 E R
intel -- -- -- -- 0 0 -- E R
medium -- -- 72:00:00 -- 0 0 12 E R
bighuge -- -- -- -- 0 0 -- E R
long -- -- -- -- 0 0 12 E R
----- -----
0 0
用户提交的作业的Wall-time是几个小时,所以我很困惑为什么要将它发送到调试队列。
此外,这是tracejob的输出:
04/08/2016 15:46:48 S enqueuing into intel, state 1 hop 1
04/08/2016 15:46:48 S dequeuing from intel, state QUEUED
04/08/2016 15:46:48 S enqueuing into debug, state 1 hop 1
04/08/2016 15:46:48 S Job Queued at request of dawn@cm01, owner = dawn@cm01, job name = run01_submit.script, queue =
debug
04/08/2016 15:46:49 S Job Run at request of root@cm01
04/08/2016 15:46:49 S child reported success for job after 0 seconds (dest=n20), rc=0
04/08/2016 15:46:49 S preparing to send 'b' mail for job 15631.cm01 to dawn@cm01 (---)
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S obit received - updating final job usage info
04/08/2016 15:46:49 S job exit status 1 handled
04/08/2016 15:46:49 S preparing to send 'e' mail for job 15631.cm01 to dawn@cm01 (Exit_status=1
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S Exit_status=1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
04/08/2016 15:46:49 S on_job_exit task assigned to job
04/08/2016 15:46:49 S req_jobobit completed
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITING
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S about to copy stdout/stderr/stageout files
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEDEL
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITED
04/08/2016 15:46:49 S JOB_SUBSTATE_COMPLETE
04/08/2016 15:50:54 S Request invalid for state of job COMPLETE
04/08/2016 15:51:00 S Request invalid for state of job COMPLETE
04/08/2016 15:51:49 S dequeuing from debug, state COMPLETE
现在解决方法是使用qalter
命令手动更改作业的已分配队列。
有什么想法吗?
答案 0 :(得分:0)
因为作业立即从intel队列跳转到调试,我怀疑你在qmgr或Maui中配置了自动路由。如果将intel队列配置为路由队列,则可以解释它。
运行qmgr -c "print queue intel"
检查。
如果它不是路由队列,您可以增加loglevel以更好地查看pbs_server日志中发生的情况。
当我以这种方式创建路由队列时,我在提交作业时会得到相同类型的tracejob输出:
05/20/2016 20:04:05.439 S enqueuing into route, state 1 hop 1
05/20/2016 20:04:05.440 S dequeuing from route, state QUEUED
05/20/2016 20:04:05.440 S enqueuing into test, state 1 hop 1
05/20/2016 20:04:05.737 S Job Run at request of root@testserver
否则,请检查Maui配置并记录线索。