ducttape有时跳过任务:跨产品错误

时间:2014-05-16 15:09:22

标签: bash ducttape

我根据这里的教程尝试使用ducttape的有时跳过任务的变体: http://nschneid.github.io/ducttape-crash-course/tutorial5.html

([ducttape] [1]是基于Bash / Scala的工作流管理工具。)

我试图做一个跨产品来执行task1"清理"数据和"脏"数据。这个想法是遍历相同的路径,但在某些情况下没有预处理。要做到这一点,我需要做一个任务的交叉产品。

task cleanup < in=(Dirty: a=data/a b=data/b) > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out
}

global {
    data=(Data: dirty=(Dirty: a=data/a b=data/b) clean=(Clean: a=$out@cleanup b=$out@cleanup))
}

task task1 < in=$data > out 
{ 
    cat $in > $out
}

plan FinalTasks {
    reach task1 via (Dirty: *) * (Data: *) * (Clean: *)
}

这是执行计划。我希望有6个任务,但我有两个重复的任务正在执行。

$ ducttape skip.tape
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 7 previous workflow versions
Finding hyperpaths contained in plan...
Found 8 vertices implied by realization plan FinalTasks
Union of all planned vertices has size 8
Checking for completed tasks from versions 1 through 7...
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Baseline.baseline (Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Dirty.b (Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Baseline.baseline (Data.dirty+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Dirty.b (Data.dirty+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean+Dirty.b (Clean.b+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean+Dirty.b (Clean.a+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean (Clean.a+Data.clean+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean (Clean.b+Data.clean+Dirty.a)
Are you sure you want to run these 8 tasks? [y/n] 

从下面的输出中删除符号链接,我的副本在这里:

$ head task1/*/out
==> Baseline.baseline/out <==
1

==> Clean.b+Data.clean/out <==
1-clean
==> Data.clean/out <==
1-clean

==> Clean.b+Data.clean+Dirty.b/out <==
2-clean
==> Data.clean+Dirty.b/out <==
2-clean

==> Dirty.b/out <==
2

ducttape经验的人可以帮助我找到我的跨产品问题吗?

  [1]: https://github.com/jhclark/ducttape

1 个答案:

答案 0 :(得分:2)

那么为什么我们有4个实现涉及分支点Clean在task1而不仅仅是两个?

这个问题的答案是in ducttape分支点总是通过任务的所有传递依赖关系传播。因此,任务“清理”中的分支点“Dirty”通过clean=(Clean: a=$out@cleanup b=$out@cleanup)传播。此时变量“clean”包含原始“Dirty”和新引入的“Clean”分支点的叉积。

要做的最小改变是改变

clean=(Clean: a=$out@cleanup b=$out@cleanup)

clean=$out@cleanup

这将为您提供所需数量的实现,但使用分支点名称“Dirty”来控制您正在使用的输入数据集有点令人困惑 - 只有这个最小的更改,两个实现的任务“清理”将是(脏:ab)。

这可能会让您的工作流程更加容易理解,如下所示:

global {
    raw_data=(DataSet: a=data/a b=data/b)
}

task cleanup < in=$raw_data > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out
}
global {
    ready_data=(DoCleanup: no=$raw_data yes=$out@cleanup)
}

task task1 < in=$ready_data > out 
{ 
    cat $in > $out
}

plan FinalTasks {
    reach task1 via (DataSet: *) * (DoCleanup: *)
}