我要求一个(每个有32个)14个处理器,如下所示:
#PBS -l nodes=1:ppn=14
#PBS -l walltime=12:00:00
低ppn
它几乎总是有效,但是一旦我得到高于14-areh的数字,作业就会开始执行并立即终止。 tracejob
非常无益:
tracejob 14753.hpc2
Job: 14753.hpc2
01/21/2017 11:12:36 L Considering job to run
01/21/2017 11:12:36 L Job run
01/21/2017 11:12:36 M Resource_List.place = scatter
01/21/2017 11:12:36 M make_cpuset, vnode hpc2[0]: hv_ncpus (2) > mvi_acpus (0) (you are not expected to understand this)
01/21/2017 11:12:36 M start_exec, new_cpuset failed
01/21/2017 11:12:36 M kill_job
01/21/2017 11:12:36 M hpc2 cput= 0:00:00 mem=0kb
01/21/2017 11:12:37 M Obit sent
01/21/2017 11:12:37 M copy file request received
01/21/2017 11:12:37 M staged 2 items out over 0:00:00
01/21/2017 11:12:37 M delete job request received
01/21/2017 11:12:37 M delete job request received
01/21/2017 11:12:38 M no active tasks
01/21/2017 11:12:38 M delete job request received
我有时成功地要求更多cpus,所以它不是完全确定的。有没有办法调试这个?
作为一个副节点,任何请求多个节点的作业永远都在队列中并且永远不会启动,我不知道这是否相关。
答案 0 :(得分:0)
您是否尝试执行“qrun”并强行尝试在指定的vnode上启动此作业?
如果没有,你能分享vnode hpc2 [0]的pbsnodes数据吗?
作为一种可能的解决方案,请尝试重启妈妈或将共享设置为妈妈独家(当然,您需要成为特权用户才能这样做)。