我根据这里的教程尝试使用ducttape的有时跳过任务的变体: http://nschneid.github.io/ducttape-crash-course/tutorial5.html
([ducttape] [1]是基于Bash / Scala的工作流管理工具。)
我试图做一个跨产品来执行task1
"清理"数据和"脏"数据。这个想法是遍历相同的路径,但在某些情况下没有预处理。要做到这一点,我需要做一个任务的交叉产品。
task cleanup < in=(Dirty: a=data/a b=data/b) > out {
prefix=$(cat $in)
echo "$prefix-clean" > $out
}
global {
data=(Data: dirty=(Dirty: a=data/a b=data/b) clean=(Clean: a=$out@cleanup b=$out@cleanup))
}
task task1 < in=$data > out
{
cat $in > $out
}
plan FinalTasks {
reach task1 via (Dirty: *) * (Data: *) * (Clean: *)
}
这是执行计划。我希望有6个任务,但我有两个重复的任务正在执行。
$ ducttape skip.tape
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 7 previous workflow versions
Finding hyperpaths contained in plan...
Found 8 vertices implied by realization plan FinalTasks
Union of all planned vertices has size 8
Checking for completed tasks from versions 1 through 7...
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Baseline.baseline (Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Dirty.b (Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Baseline.baseline (Data.dirty+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Dirty.b (Data.dirty+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean+Dirty.b (Clean.b+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean+Dirty.b (Clean.a+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean (Clean.a+Data.clean+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean (Clean.b+Data.clean+Dirty.a)
Are you sure you want to run these 8 tasks? [y/n]
从下面的输出中删除符号链接,我的副本在这里:
$ head task1/*/out
==> Baseline.baseline/out <==
1
==> Clean.b+Data.clean/out <==
1-clean
==> Data.clean/out <==
1-clean
==> Clean.b+Data.clean+Dirty.b/out <==
2-clean
==> Data.clean+Dirty.b/out <==
2-clean
==> Dirty.b/out <==
2
有ducttape
经验的人可以帮助我找到我的跨产品问题吗?
[1]: https://github.com/jhclark/ducttape
答案 0 :(得分:2)
那么为什么我们有4个实现涉及分支点Clean在task1而不仅仅是两个?
这个问题的答案是in ducttape分支点总是通过任务的所有传递依赖关系传播。因此,任务“清理”中的分支点“Dirty”通过clean=(Clean: a=$out@cleanup b=$out@cleanup)
传播。此时变量“clean”包含原始“Dirty”和新引入的“Clean”分支点的叉积。
要做的最小改变是改变
clean=(Clean: a=$out@cleanup b=$out@cleanup)
到
clean=$out@cleanup
这将为您提供所需数量的实现,但使用分支点名称“Dirty”来控制您正在使用的输入数据集有点令人困惑 - 只有这个最小的更改,两个实现的任务“清理”将是(脏:ab)。
这可能会让您的工作流程更加容易理解,如下所示:
global {
raw_data=(DataSet: a=data/a b=data/b)
}
task cleanup < in=$raw_data > out {
prefix=$(cat $in)
echo "$prefix-clean" > $out
}
global {
ready_data=(DoCleanup: no=$raw_data yes=$out@cleanup)
}
task task1 < in=$ready_data > out
{
cat $in > $out
}
plan FinalTasks {
reach task1 via (DataSet: *) * (DoCleanup: *)
}