我正在使用IBM LSF并尝试在一段时间内获取使用情况统计信息。我发现bhist
完成了这项工作,但是短格式bhist
输出并没有显示我需要的所有字段。
我想知道的是:
bhist的输出字段是否可自定义?我需要的领域是:
如果 1 不可用,则长格式(bhist -l
)输出会显示我需要的所有内容,但格式很难操作。我已经粘贴了以下格式的示例。
例如,记录之间的行数不固定,并且每个事件中的自动换行可能会破坏我正在尝试扫描的单词中间的行。如何使用sed
和awk
解析此格式?
JobId <1531>, User <user1>, Project <default>, Command< example200>
Fri Dec 27 13:04:14: Submitted from host <hostA> to Queue <priority>, CWD <$H
OME>, Specified Hosts <hostD>;
Fri Dec 27 13:04:19: Dispatched to <hostD>;
Fri Dec 27 13:04:19: Starting (Pid 8920);
Fri Dec 27 13:04:20: Running with execution home </home/user1>, Execution CWD
</home/user1>, Execution Pid <8920>;
Fri Dec 27 13:05:49: Suspended by the user or administrator;
Fri Dec 27 13:05:56: Suspended: Waiting for re-scheduling after being resumed
by user;
Fri Dec 27 13:05:57: Running;
Fri Dec 27 13:07:52: Done successfully. The CPU time used is 28.3 seconds.
Summary of time in seconds spent in various states by Sat Dec 27 13:07:52 1997
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
5 0 205 7 1 0 218
------------------------------------------------------------
.... repeat
答案 0 :(得分:0)
长格式输出很难解析。我知道bjobs
在旧的LSF版本中有一个未格式化输出(-UF
)的选项,这使得它更容易一些,最新版本的LSF允许您自定义哪些列以短格式输出打印与-o
。
很遗憾,这两个选项都不适用于bhist
。历史信息的唯一真正可能性是:
bhist -l
的方法 - 由于您发现的格式不一致而不切实际甚至可能无法实现。bhist
自身用于解析lsb.events
文件的函数。这是存储有关LSF群集的所有历史信息的文件,是bhist
读取以生成其ouptut的内容。lsb.events
文件 - 该格式记录在配置参考中。这很难,但并非不可能。 Here是LSF 9.1.3的相关文件。我的个人建议是#2 - 您正在寻找的功能是lsb_geteventrec()
。您基本上每次只读取lsb.events
中的每一行,并提取您需要的信息。
答案 1 :(得分:0)
我正在添加第二个答案,因为它可以帮助您解决问题而无需编写自己的解决方案(取决于您之后使用的使用情况统计信息)。
LSF已经有一个名为bacct
的实用程序,它可以计算并打印出有关按各种标准过滤的历史LSF作业的各种使用统计信息。
例如,要获取有关在time0和time1之间调度/完成/提交的作业的摘要使用情况统计信息,您可以分别使用:
bacct -D time0,time1
bacct -C time0,time1
bacct -S time0,time1
特定用户提交的工作统计信息:
bacct -u <username>
提交到特定队列的作业统计信息:
bacct -q <queuename>
这些选项也可以组合使用,例如,如果您想要了解在特定项目的特定时间窗口内提交和完成的作业的统计信息,您可以使用:
bacct -S time0,time1 -C time0,time1 -P <projectname>
输出提供了与所提供标准匹配的所有作业的一些摘要信息,如下所示:
$ bacct -u bobbafett -q normal
Accounting information about jobs that are:
- submitted by users bobbafett,
- accounted on all projects.
- completed normally or exited
- executed on all hosts.
- submitted to queues normal,
- accounted on all service classes.
------------------------------------------------------------------------------
SUMMARY: ( time unit: second )
Total number of done jobs: 0 Total number of exited jobs: 32
Total CPU time consumed: 46.8 Average CPU time consumed: 1.5
Maximum CPU time of a job: 9.0 Minimum CPU time of a job: 0.0
Total wait time in queues: 18680.0
Average wait time in queue: 583.8
Maximum wait time in queue: 5507.0 Minimum wait time in queue: 0.0
Average turnaround time: 11568 (seconds/job)
Maximum turnaround time: 43294 Minimum turnaround time: 40
Average hog factor of a job: 0.00 ( cpu time / turnaround time )
Maximum hog factor of a job: 0.02 Minimum hog factor of a job: 0.00
Total Run time consumed: 351504 Average Run time consumed: 10984
Maximum Run time of a job: 1844674 Minimum Run time of a job: 0
Total throughput: 0.24 (jobs/hour) during 160.32 hours
Beginning time: Nov 11 17:55 Ending time: Nov 18 10:14
此命令还有一个长格式输出,提供了一些bhist -l
- 类似于每个作业的信息,可能更容易解析(尽管仍然不是那么容易):
$ bacct -l -u bobbafett -q normal
Accounting information about jobs that are:
- submitted by users bobbafett,
- accounted on all projects.
- completed normally or exited
- executed on all hosts.
- submitted to queues normal,
- accounted on all service classes.
------------------------------------------------------------------------------
Job <101>, User <bobbafett>, Project <default>, Status <EXIT>, Queue <normal>,
Command <sleep 100000000>
Wed Nov 11 17:37:45: Submitted from host <endor>, CWD <$HOME>;
Wed Nov 11 17:55:05: Completed <exit>; TERM_OWNER: job killed by owner.
Accounting information about this job:
CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP
0.00 1040 1040 exit 0.0000 0M 0M
------------------------------------------------------------------------------
...