理解并克服snakemake

时间:2017-07-26 16:58:22

标签: snakemake

我有一个复杂的工作流程,我逐渐扩展。最后一次扩展导致了AmbiguousRuleException。我尝试在以下示例中重现工作流的关键结构:

NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]


rule all:
    input:
        # (1)
        expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
        #expand("results/allthings/{word}_{choice}.md5sum", word=WORDS + ["all"], choice=CHOICES)

rule make_things:
    output:
        "results/{letter}_{number}/{word}_{choice}.txt"
    shell:
        """
        echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
        """

rule gather_things:
    input:
        expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
    output:
        "results/allthings/{word}_{choice}.txt"
    shell:
        """
        cat {input} > {output}
        """

# (2)
#rule join_all_words:
#    input:
#        expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
#    output:
#        "results/allthings/all_{choice}.txt"
#    shell:
#        """
#        cat {input} > {output}
#        """
# (3)
#def source_data(wildcards):
#    if wildcards.word == "all":
#        return rules.join_all_words.output
#    else:
#        return rules.gather_things.output

rule compute_md5:
    input:
        # (4)
        rules.gather_things.output,
        #source_data
    output:
        "results/allthings/{word}_{choice}.md5sum"
    shell:
        """
        md5sum {input} > {output}
        """

上述状态是有效的。切换(1)(4)并取消注释(2)(3)对应于我尝试制作的扩展程序,并导致以下失败:

AmbiguousRuleException:
Rules gather_things and join_all_words are ambiguous for the file results/allthings/all_yes.txt.
Expected input files:
    gather_things: results/a_1/all_yes.txt results/a_2/all_yes.txt results/b_1/all_yes.txt results/b_2/all_yes.txt results/c_1/all_yes.txt results/c_2/all_yes.txt
    join_all_words: results/allthings/foo_yes.txt results/allthings/bar_yes.txt results/allthings/baz_yes.txt

似乎snakemake认为results/allthings/all_yes.txt可以生成gather_things

为什么吗

我该如何避免?

注意:修改(3)(4)的目标是让compute_md5同时处理gather_things的直接输出(foo,{ {1}}和bar)以及三个(baz)的连接输出,保持输入尽可能多地根据其他规则的输出定义(这使得更改比文件名是明确使用的。)

1 个答案:

答案 0 :(得分:1)

2017-07-28为简洁而编辑的帖子

最初我认为这只是含糊不清。前3个点涉及解决歧义。之后,我将解释如何概括“compute_md5”以实现所需的行为。

控制歧义

1)控制模糊性:

<强> ruleorder http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules

我建议在以下情况下避免这种情况。在模块化的宏伟希望中,通过使用“ruleorder”,您实际上将两个规则结合在一起。只有在Snakefile范围内存在这两个规则时,才能使用“ruleorder”功能。如果不总是一起提供规则,则这可能是模块化的问题。如果他们的规则总是一起提供,我会认为他们已经结合在一起了,这样做并不会使情况变得更糟,事实上,增加了凝聚力。使用“约束”时使用“约束”是不够的,因为有时会出现不可避免的歧义。

https://en.wikipedia.org/wiki/GRASP_(object-oriented_design)

有条件的'包含' https://github.com/tboyarski/BCCRC-Snakemake/tree/master/modules/bamGen

规则顺序在“_INCLUDE”中 sam2BAM和bamALign_bwa的输出非常相似,主要是因为sam2BAM是如此通用。

因为bamALign_bwa和bamALIGN_star在技术上是可切换的,并且我不希望用户在规则顺序之间交换只是为了在它们之间切换,我有一个布尔值,我存储在我的YAML文件中,充当硬过滤器,从字面上防止Snakemake甚至看到规则。这在你只能选择其中一个的情况下非常有效(在这种情况下,两个对齐方有自己的参考基因组。我强迫用户在我的管道的乞讨处设置参考基因组,这样用户实际上可能无法同时运行我没有实现检测使用哪个参考基因组的功能,以便随后选择相应的对齐器。这将是一些头脑中的python代码,很棒的想法,但目前还没有实现。)

2)要求Snakemake忽略歧义。

超越。它存在,但我认为应尽可能避免“ - 允许歧义”。

http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=--allow-ambiguity#handling-ambiguous-rules

3)优雅〜防止歧义。

http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=wildcard_constraints#wildcards

rule gather_things:
     input:
         expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
     output:
         "results/allthings/{word}_{choice}.txt"
      wildcard_constraints:
         word='[^(all)][0-9a-zA-Z]*'
...

此规则需要一个wildcard_constraint,以防止它与“join_all_words”规则竞争。这可以通过防止这里的通配符“单词”轻松完成,而不是字符串'all'。这使得“gather_things”和“join_all_words”可以区分。

compute_md5 generalizability

至于让“compute_md5”接受来自“gather_things”和“join_all_words”的输入,这需要使其更加通用化,与模糊性无关。接下来你需要做的是调整“join_all_words”规则,这样它就不依赖于任何给定规则的输入。

https://github.com/tboyarski/BCCRC-Snakemake/blob/master/help/download.svg

我还要感谢您提供一个TOP-NOTCH示例。辉煌!

 NUMBERS = ["1", "2"]
 LETTERS = ["a", "b", "c"]
 WORDS = ["foo", "bar", "baz"]
 CHOICES = ["yes", "no"]


 rule all:
     input:
         expand("results/allthings/all_{choice}.md5sum", choice=CHOICES),
         expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)

 rule make_things:
     output:
         "results/{letter}_{number}/{word}_{choice}.txt"
     shell:
         """
         echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
         """

 rule gather_things:
     input:
         expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
     output:
         "results/allthings/{word}_{choice}.txt"
     wildcard_constraints:
         word='[^(all)][0-9a-zA-Z]*'
     shell:
         """
         cat {input} > {output}
         """

 rule join_all_words:
     input:
         expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
     output:
         "results/allthings/all_{choice}.txt"
     shell:
         """
         cat {input} > {output}
         """

 rule compute_md5:
     input:
         "{pathCMD5}/{sample}.txt"
     output:
         "{pathCMD5}/{sample}.md5sum"
         #"results/allthings/{word}_{choice}.md5sum"
     shell:
         """
         md5sum {input} > {output}