Question

我正在运行一个贪婪的特征选择算法，并且尝试使用作业数组来探索并行化。

我们的想法是取决于上一步，我们分为三个步骤：

第1步：设置迭代i
步骤2：在迭代i时拟合模型
第3步：在第i次迭代中找到最佳模型

由于您需要所有模型（> 10）在开始步骤3之前已完成培训，因此普通的旧工作链并不是最佳选择。因此，我尝试使用作业阵列，该阵列可以完全满足我的要求：只有在安装了所有模型之后，我才进入步骤3。

但是，我在设置依赖项时遇到了麻烦。有人告诉我，整个作业数组的依赖关系必须是作业ID（即数字），而不是作业名称（例如runSetup$n_subject$i）。

那么：如何从整个作业数组中获取作业ID？或更妙的是：如何最好地为整个作业数组设置依赖项？

这个answer很有趣，但是当我的工作数组包含10个或更多工作时，并没有告诉我如何最好地设置依赖项。

#!/bin/bash

# Subject to consider
n_subject=$1 # takes in input arguments from command line.
cohort=$2
priors_and_init=$3
nparam=16

for ((i = 1; i <= $nparam; i++)); do
    # Run setup
    if [[ $i -eq 1 ]]; then
      bsub -J "runSetup$n_subject$i" matlab  -singleCompThread -nodisplay -r "setup_greedy_forward($n_subject,$cohort, $priors_and_init, $i)"
    else
      last_iter=$((i-1))
      bsub -J "runSetup$n_subject$i" -w "done(saveBest$n_subject$last_iter)" matlab  -singleCompThread -nodisplay -r "setup_greedy_forward($n_subject,$cohort, $priors_and_init, $i)"
    fi

    # Fit models
    max_sim=$((nparam-i+1))
    bsub -W 08:00 -J "fitDCMs$n_subject[1-$max_sim]" -w "done(runSetup$n_subject$i)" -R "rusage[mem=16000]" matlab  -singleCompThread -nodisplay -r "fit_dcm_greedy_forward($n_subject,$cohort, $priors_and_init, \$LSB_JOBINDEX)"

    # Extracting the job ID from the fitDCMs jobs
    # Then: For all trained DCMs, get the best model and save it
    JOBID=$(get_jobid bsub -W 08:00 -J "fitDCMs$n_subject[1-$max_sim]" -w "done(runSetup$n_subject$i)" -R "rusage[mem=16000]" matlab  -singleCompThread -nodisplay -r "fit_dcm_greedy_forward($n_subject,$cohort, $priors_and_init, \$LSB_JOBINDEX)" 2> /dev/null)
    if [ -n "$jobid" ]; then
        bsub -J "saveBest$n_subject$i" -w "numdone($JOBID,*)" matlab -singleCompThread -nodisplay -r "save_best_model($n_subject,$cohort, $priors_and_init, $i)"
    fi
done

我得到的输出：

MATLAB job.
Job <94564566> is submitted to queue <normal.24h>.
MATLAB job.
Job <94564567> is submitted to queue <normal.24h>.
MATLAB job.
saveBest121: No matching job found. Job not submitted.
MATLAB job.
runSetup122: No matching job found. Job not submitted.
[…]

Answer 1

有人告诉我，整个作业数组的依赖关系必须是作业ID（即数字），而不是作业名称

应该可以。例如：

bsub -J "iterate[1-10]" ...
bsub -J "finalize" -w "done(iterate)" ...

在完成finalize的所有元素之后，作业iterate才开始。

Answer 2

经过一番搜索，我找到了一种获取工作ID的方法。

JOBID=$(bsub command1 | awk '/is submitted/{print substr($2, 2, length($2)-2);}')
if [ -n "$JOBID" ]; then
    bsub -w "numdone($JOBID,*)" command2
fi

第一行提交作业并提取其作业ID。

找到答案here。

如何从for循环内的作业数组中获取作业ID？

2 个答案: