我找到了一个帖子,说明如何告诉bsub在运行here之前等待一组指定的作业完成,但是这只有在事先了解作业数量时才有效。
我想运行任意数量的工作,并运行"包装"我的所有工作完成后的工作
这是我的剧本:
#!/bin/bash
for file in dir/*; do # I don't know how many jobs will be created
bsub "./do_it_once.sh $file"
done
bsub -w "done(1) && done(2) && done(3)" merge_results.sh
当提交了3个作业时,上述脚本将起作用,但是如果有n个作业怎么办?如何指定我想等待所有工作完成?
答案 0 :(得分:1)
修改请参阅kamula's answer了解实际效果:)。
从未使用bsub
,但快速浏览the man page,我认为可能会这样做:
#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
bsub -J "myjobs[$jobnum]" "./do_it_once.sh $file"
jobnum=$((jobnum + 1))
done
bsub -w "done(myjobs[*])" merge_results.sh
使用bsub
变量myjobs[]
,在名为bash
的{{1}}数组中使用顺序索引命名作业。然后最后jobnum
等待所有bsub
个工作完成。
YMMV!
哦 - 此外,您可能需要使用myjobs[]
(-J "\"myjobs[...]\""
)。手册页说要用双引号括起作业名称,但是我不知道是否有\"
要求,或者他们是否假设您将使用扩展未引用文本的shell。
答案 1 :(得分:1)
基于cxw's reply,我得到了一些工作。它不使用数组。但是,-w命令可以使用通配符,因此我以相似的方式命名每个作业。
仍然不确定这是否是调用bsub
的正确方法,因为每次都需要调用一次,但它有效。
从cxw编辑:
#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
bsub -J "myjobs${jobnum}" "./do_it_once.sh $file"
jobnum=$((jobnum + 1))
done
bsub -w "done(myjobs*)" merge_results.sh
答案 2 :(得分:0)
这是我的完整解决方案,它增加了时间控制并提供了失败作业的数量。如果需要,还要注意杀死失败工作的孩子,并处理僵尸或不间断的过程:
function Logger {
echo "$1"
}
# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
local pid="${1}" # Parent pid to kill childs
local self="${2:-false}" # Should parent be killed too ?
if children="$(pgrep -P "$pid")"; then
KillChilds "$child" true
done
fi
# Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
if ( [ "$self" == true ] && kill -0 $pid > /dev/null 2>&1); then
kill -s TERM "$pid"
if [ $? != 0 ]; then
sleep 15
Logger "Sending SIGTERM to process [$pid] failed."
kill -9 "$pid"
if [ $? != 0 ]; then
Logger "Sending SIGKILL to process [$pid] failed."
return 1
fi
else
return 0
fi
else
return 0
fi
}
function WaitForTaskCompletion {
local pids="${1}" # pids to wait for, separated by semi-colon
local soft_max_time="${2}" # If program with pid $pid takes longer than $soft_max_time seconds, will log a warning, unless $soft_max_time equals 0.
local hard_max_time="${3}" # If program with pid $pid takes longer than $hard_max_time seconds, will stop execution, unless $hard_max_time equals 0.
local caller_name="${4}" # Who called this function
local counting="${5:-true}" # Count time since function has been launched if true, since script has been launched if false
local keep_logging="${6:-0}" # Log a standby message every X seconds. Set to zero to disable logging
local soft_alert=false # Does a soft alert need to be triggered, if yes, send an alert once
local log_ttime=0 # local time instance for comparaison
local seconds_begin=$SECONDS # Seconds since the beginning of the script
local exec_time=0 # Seconds since the beginning of this function
local retval=0 # return value of monitored pid process
local errorcount=0 # Number of pids that finished with errors
local pid # Current pid working on
local pidCount # number of given pids
local pidState # State of the process
local pidsArray # Array of currently running pids
local newPidsArray # New array of currently running pids
IFS=';' read -a pidsArray <<< "$pids"
pidCount=${#pidsArray[@]}
WAIT_FOR_TASK_COMPLETION=""
while [ ${#pidsArray[@]} -gt 0 ]; do
newPidsArray=()
Spinner
if [ $counting == true ]; then
exec_time=$(($SECONDS - $seconds_begin))
else
exec_time=$SECONDS
fi
if [ $keep_logging -ne 0 ]; then
if [ $((($exec_time + 1) % $keep_logging)) -eq 0 ]; then
if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1s
log_ttime=$exec_time
fi
fi
fi
if [ $exec_time -gt $soft_max_time ]; then
if [ $soft_alert == true ] && [ $soft_max_time -ne 0 ]; then
Logger "Max soft execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]."
soft_alert=true
SendAlert true
fi
if [ $exec_time -gt $hard_max_time ] && [ $hard_max_time -ne 0 ]; then
Logger "Max hard execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution."
for pid in "${pidsArray[@]}"; do
KillChilds $pid true
if [ $? == 0 ]; then
Logger "Task with pid [$pid] stopped successfully." "NOTICE"
else
Logger "Could not stop task with pid [$pid]." "ERROR"
fi
done
SendAlert true
errrorcount=$((errorcount+1))
fi
fi
for pid in "${pidsArray[@]}"; do
if [ $(IsNumeric $pid) -eq 1 ]; then
if kill -0 $pid > /dev/null 2>&1; then
# Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
#TODO(high): have this tested on *BSD, Mac & Win
pidState=$(ps -p$pid -o state= 2 > /dev/null)
if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
newPidsArray+=($pid)
fi
else
# pid is dead, get it's exit code from wait command
wait $pid
retval=$?
if [ $retval -ne 0 ]; then
errorcount=$((errorcount+1))
Logger "${FUNCNAME[0]} called by [$caller_name] finished monitoring [$pid] with exitcode [$retval]. "DEBUG"
if [ "$WAIT_FOR_TASK_COMPLETION" == "" ]; then
WAIT_FOR_TASK_COMPLETION="$pid:$retval"
else
WAIT_FOR_TASK_COMPLETION=";$pid:$retval"
fi
fi
fi
fi
done
pidsArray=("${newPidsArray[@]}")
# Trivial wait time for bash to not eat up all CPU
sleep .05
done
# Return exit code if only one process was monitored, else return number of errors
if [ $pidCount -eq 1 ] && [ $errorcount -eq 0 ]; then
return $errorcount
else
return $errorcount
fi
}
用法:
让我们做3个睡眠工作,获取他们的pid并将它们发送给WaitforTaskCompletion:
sleep 10 &
pids="$!"
sleep 15 &
pids="$pids;$!"
sleep 20 &
pids="$pids;$!"
WaitForTaskCompletion $pids 1800 3600 ${FUNCNAME[0]} false 1800
前面的示例会警告您执行是否超过1小时,2小时后停止执行,并每半小时发送一条“活动”日志消息。
答案 3 :(得分:0)
由于bjobs
的输出在没有作业挂起/正在运行时为1行(No unfinished job found
),在至少有1个作业在挂起/正在运行时为2行:
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
25156 awesome RUN best_queue superhost 30*host cool_name Jun 16 05:38
您可以使用以下方法在bjobs | wc -l
上循环播放:
for job in $some_jobs;
bsub < $job
# Waiting for jobs to complete
while [[ `bjobs | wc -l` -ge 2 ]] ; do \
sleep 15
done
done
此技术的一个好处是,您可以启动多个作业,而不管您需要运行多少个作业。只是在等待之前循环播放它们。 显然,这不是最清洁的方法,但目前可以使用。
for some_jobs in $job_groups; do \
for job in $some_jobs; do \
bsub < $job
done
# Waiting for jobs to complete
while [[ `bjobs | wc -l` -ge 2 ]] ; do \
sleep 15
done
done