TORQUE工作挂在Ubuntu 16.04上

时间:2017-10-29 16:57:03

标签: ubuntu-16.04 torque

我在Ubuntu 16.04上安装了TORQUE,因为我的工作挂起而遇到了麻烦。我有一个测试脚本test.pbs

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"

我用

运行它
qsub test.pbs

作业写done.txt并回复"done"就好了,但作业挂起C状态。

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost              test             wlandau         00:00:00 C batch    

编辑:来自qstat -f 55

的其他作业的一些诊断信息
qstat -f 55
Job Id: 55.localhost
    Job_Name = test
    Job_Owner = wlandau@localhost
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = C
    queue = batch
    server = haggunenon
    Checkpoint = u
    ctime = Mon Oct 30 07:35:00 2017
    Error_Path = localhost:/home/wlandau/Desktop/test.e55
    exec_host = localhost/2
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Oct 30 07:35:00 2017
    Output_Path = localhost:/home/wlandau/Desktop/test.o55
    Priority = 0
    qtime = Mon Oct 30 07:35:00 2017
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=1
    Resource_List.walltime = 00:01:00
    session_id = 5115
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
        PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
        PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
        :/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
        mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
        PBS_O_WORKDIR=/home/wlandau/Desktop
    comment = Job started on Mon Oct 30 at 07:35
    etime = Mon Oct 30 07:35:00 2017
    exit_status = 0
    submit_args = test.pbs
    start_time = Mon Oct 30 07:35:00 2017
    Walltime.Remaining = 60
    start_count = 1
    fault_tolerant = False
    comp_time = Mon Oct 30 07:35:00 2017

和类似的tracejob -n2 62

/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located

Job: 62.localhost

10/30/2017 17:20:25  S    enqueuing into batch, state 1 hop 1
10/30/2017 17:20:25  S    Job Queued at request of wlandau@localhost, owner =
                          wlandau@localhost, job name = jobe945093c2e029c5de5619d6bf7922071,
                          queue = batch
10/30/2017 17:20:25  S    Job Modified at request of Scheduler@Haggunenon
10/30/2017 17:20:25  S    Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
                          resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:20:25  L    Job Run
10/30/2017 17:20:25  S    Job Run at request of Scheduler@Haggunenon
10/30/2017 17:20:25  S    Not sending email: User does not want mail of this type.
10/30/2017 17:20:25  S    Not sending email: User does not want mail of this type.
10/30/2017 17:20:25  M    job was terminated
10/30/2017 17:20:25  M    obit sent to server
10/30/2017 17:20:25  A    queue=batch
10/30/2017 17:20:25  M    scan_for_terminated: job 62.localhost task 1 terminated, sid=17917
10/30/2017 17:20:25  A    user=wlandau group=wlandau
                          jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
                          ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
                          owner=wlandau@localhost exec_host=localhost/0 Resource_List.ncpus=1
                          Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=01:00:00 
10/30/2017 17:20:25  A    user=wlandau group=wlandau
                          jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
                          ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
                          owner=wlandau@localhost exec_host=localhost/0 Resource_List.ncpus=1
                          Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=17917
                          end=1509398425 Exit_status=0 resources_used.cput=00:00:00
                          resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:00

编辑:现在挂在E

中的作业

经过一些修修补补,我现在正在使用these settings。我已经转到this tiny pipeline workflow,其中一些TORQUE作业等待其他TORQUE作业完成。不幸的是,所有作业都挂在E状态,任何超过4的作业都会排在队列中。为了防止事情无限期挂起,我必须sudo qdel -p每个人,我认为这会导致项目文件系统的合法问题以及不便。

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
113.localhost             ...b73ec2cda6dca wlandau         00:00:00 E batch          
114.localhost             ...b6c8e6da05983 wlandau         00:00:00 E batch          
115.localhost             ...9123b8e20850b wlandau         00:00:00 E batch          
116.localhost             ...e6d49a3d7d822 wlandau         00:00:00 E batch          
117.localhost             ...8c3f6cb68927b wlandau                0 Q batch          
118.localhost             ...40b1d0cab6400 wlandau                0 Q batch  

qmgr -c "list server"显示

Server haggunenon
        server_state = Active
        scheduling = True
        max_running = 300
        total_jobs = 5
        state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:3 
        acl_hosts = localhost
        managers = root@localhost
        operators = root@localhost
        default_queue = batch
        log_events = 511
        mail_from = adm
        query_other_jobs = True
        resources_assigned.ncpus = 4
        resources_assigned.nodect = 4
        scheduler_iteration = 600
        node_check_rate = 150
        tcp_timeout = 6
        mom_job_sync = True
        pbs_version = 2.4.16
        keep_completed = 0
        submit_hosts = SERVER
        allow_node_submit = True
        next_job_number = 119
        net_counter = 118 94 93

qmgr -c "list queue batch"

Queue batch
        queue_type = Execution
        total_jobs = 5
        state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:4 
        max_running = 300
        resources_max.ncpus = 4
        resources_max.nodes = 2
        resources_min.ncpus = 1
        resources_default.ncpus = 1
        resources_default.nodect = 1
        resources_default.nodes = 1
        resources_default.walltime = 01:00:00
        mtime = Wed Nov  1 07:40:45 2017
        resources_assigned.ncpus = 4
        resources_assigned.nodect = 4
        keep_completed = 0
        enabled = True
        started = True

1 个答案:

答案 0 :(得分:0)

C状态表示作业已完成,其状态保留在系统中。通常,作业完成后状态将保留keep_completed参数指定的一段时间。但是,某些类型的故障可能导致作业保持在此状态,以提供检查故障原因所需的信息。

检查qstat -f 46的输出,看是否有任何错误指示。

要调整keep_completed参数,您可以执行以下命令来检查系统上此参数的值。

qmgr -c "print queue batch keep_completed"

如果您对Torque服务器拥有管理权限,您还可以使用

更改此值
qmgr -c "set queue batch keep_completed=120"

在完成后将作业保持在完成状态2分钟。

通常,设置keep_completed是一个有用的功能。高级调度程序使用已完成作业的信息来安排失败。