只有在我之前的所有工作完成后才能运行工作

时间:2016-08-30 14:29:17

标签: bash lsf

我找到了一个帖子,说明如何告诉bsub在运行here之前等待一组指定的作业完成,但是这只有在事先了解作业数量时才有效。

我想运行任意数量的工作,并运行"包装"我的所有工作完成后的工作

这是我的剧本:

#!/bin/bash
for file in dir/*; do # I don't know how many jobs will be created
    bsub "./do_it_once.sh $file"
done

bsub -w "done(1) && done(2) && done(3)" merge_results.sh

当提交了3个作业时,上述脚本将起作用,但是如果有n个作业怎么办?如何指定我想等待所有工作完成?

4 个答案:

答案 0 :(得分:1)

修改请参阅kamula's answer了解实际效果:)。

原始答案

从未使用bsub,但快速浏览the man page,我认为可能会这样做:

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs[$jobnum]" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs[*])" merge_results.sh

使用bsub变量myjobs[],在名为bash的{​​{1}}数组中使用顺序索引命名作业。然后最后jobnum等待所有bsub个工作完成。

YMMV!

哦 - 此外,您可能需要使用myjobs[]-J "\"myjobs[...]\"")。手册页说要用双引号括起作业名称,但是我不知道是否有\"要求,或者他们是否假设您将使用扩展未引用文本的shell。

答案 1 :(得分:1)

基于cxw's reply,我得到了一些工作。它不使用数组。但是,-w命令可以使用通配符,因此我以相似的方式命名每个作业。

仍然不确定这是否是调用bsub的正确方法,因为每次都需要调用一次,但它有效。

从cxw编辑:

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs${jobnum}" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs*)" merge_results.sh

答案 2 :(得分:0)

这是我的完整解决方案,它增加了时间控制并提供了失败作业的数量。如果需要,还要注意杀死失败工作的孩子,并处理僵尸或不间断的过程:

function Logger {
    echo "$1"
}

# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
    local pid="${1}" # Parent pid to kill childs
    local self="${2:-false}" # Should parent be killed too ?


    if children="$(pgrep -P "$pid")"; then
            KillChilds "$child" true
        done
    fi
        # Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
    if ( [ "$self" == true ] && kill -0 $pid > /dev/null 2>&1); then
        kill -s TERM "$pid"
        if [ $? != 0 ]; then
            sleep 15
            Logger "Sending SIGTERM to process [$pid] failed."
            kill -9 "$pid"
            if [ $? != 0 ]; then
                Logger "Sending SIGKILL to process [$pid] failed."
                return 1
            fi
        else
            return 0
        fi
    else
        return 0
    fi
}

function WaitForTaskCompletion {
    local pids="${1}" # pids to wait for, separated by semi-colon
    local soft_max_time="${2}" # If program with pid $pid takes longer than $soft_max_time seconds, will log a warning, unless $soft_max_time equals 0.
    local hard_max_time="${3}" # If program with pid $pid takes longer than $hard_max_time seconds, will stop execution, unless $hard_max_time equals 0.
    local caller_name="${4}" # Who called this function
    local counting="${5:-true}" # Count time since function has been launched if true, since script has been launched if false
    local keep_logging="${6:-0}" # Log a standby message every X seconds. Set to zero to disable logging

    local soft_alert=false # Does a soft alert need to be triggered, if yes, send an alert once
    local log_ttime=0 # local time instance for comparaison

    local seconds_begin=$SECONDS # Seconds since the beginning of the script
    local exec_time=0 # Seconds since the beginning of this function

    local retval=0 # return value of monitored pid process
    local errorcount=0 # Number of pids that finished with errors

    local pid   # Current pid working on
    local pidCount # number of given pids
    local pidState # State of the process

    local pidsArray # Array of currently running pids
    local newPidsArray # New array of currently running pids

    IFS=';' read -a pidsArray <<< "$pids"
    pidCount=${#pidsArray[@]}

    WAIT_FOR_TASK_COMPLETION=""

    while [ ${#pidsArray[@]} -gt 0 ]; do
        newPidsArray=()

        Spinner
        if [ $counting == true ]; then
            exec_time=$(($SECONDS - $seconds_begin))
        else
            exec_time=$SECONDS
        fi

        if [ $keep_logging -ne 0 ]; then
            if [ $((($exec_time + 1) % $keep_logging)) -eq 0 ]; then
                if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1s
                    log_ttime=$exec_time
                fi
            fi
        fi

        if [ $exec_time -gt $soft_max_time ]; then
            if [ $soft_alert == true ] && [ $soft_max_time -ne 0 ]; then
                Logger "Max soft execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]."
                soft_alert=true
                SendAlert true

            fi
            if [ $exec_time -gt $hard_max_time ] && [ $hard_max_time -ne 0 ]; then
                Logger "Max hard execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution."
                for pid in "${pidsArray[@]}"; do
                    KillChilds $pid true
                    if [ $? == 0 ]; then
                        Logger "Task with pid [$pid] stopped successfully." "NOTICE"
                    else
                        Logger "Could not stop task with pid [$pid]." "ERROR"
                    fi
                done
                SendAlert true
                errrorcount=$((errorcount+1))
            fi
        fi

        for pid in "${pidsArray[@]}"; do
            if [ $(IsNumeric $pid) -eq 1 ]; then
                if kill -0 $pid > /dev/null 2>&1; then
                    # Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
                    #TODO(high): have this tested on *BSD, Mac & Win
                    pidState=$(ps -p$pid -o state= 2 > /dev/null)
                    if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
                        newPidsArray+=($pid)
                    fi
                else
                    # pid is dead, get it's exit code from wait command
                    wait $pid
                    retval=$?
                    if [ $retval -ne 0 ]; then
                        errorcount=$((errorcount+1))
                        Logger "${FUNCNAME[0]} called by [$caller_name] finished monitoring [$pid] with exitcode [$retval]. "DEBUG"
                        if [ "$WAIT_FOR_TASK_COMPLETION" == "" ]; then
                            WAIT_FOR_TASK_COMPLETION="$pid:$retval"
                        else
                            WAIT_FOR_TASK_COMPLETION=";$pid:$retval"
                        fi
                    fi
                fi

            fi
        done

        pidsArray=("${newPidsArray[@]}")
        # Trivial wait time for bash to not eat up all CPU
        sleep .05
    done

    # Return exit code if only one process was monitored, else return number of errors
    if [ $pidCount -eq 1 ] && [ $errorcount -eq 0 ]; then
        return $errorcount
    else
        return $errorcount
    fi
}

用法:

让我们做3个睡眠工作,获取他们的pid并将它们发送给WaitforTaskCompletion:

sleep 10 &
pids="$!"
sleep 15 &
pids="$pids;$!"
sleep 20 &
pids="$pids;$!"

WaitForTaskCompletion $pids 1800 3600 ${FUNCNAME[0]} false 1800

前面的示例会警告您执行是否超过1小时,2小时后停止执行,并每半小时发送一条“活动”日志消息。

答案 3 :(得分:0)

由于bjobs的输出在没有作业挂起/正在运行时为1行(No unfinished job found),在至少有1个作业在挂起/正在运行时为2行:

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
25156   awesome RUN   best_queue superhost   30*host     cool_name  Jun 16 05:38

您可以使用以下方法在bjobs | wc -l上循环播放:

for job in $some_jobs; 
    bsub < $job

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done

此技术的一个好处是,您可以启动多个作业,而不管您需要运行多少个作业。只是在等待之前循环播放它们。 显然,这不是最清洁的方法,但目前可以使用。

for some_jobs in $job_groups; do \
    for job in $some_jobs; do \
        bsub < $job
    done

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done