一个奇怪的蛇形案

时间:2019-03-01 14:05:19

标签: snakemake

我的目标与Snakemake: unknown output/input files after splitting by chromosome中的目标相似,但是,如前所述,我确实知道我的sample.bam文件中有例如 5条染色体。用作玩具示例:

$ cat sample.bam 
chromosome 1
chromosome 2
chromosome 3
chromosome 4
chromosome 5

我希望“拆分”这个bam文件,然后在生成的染色体上进行一堆每个染色体下游工作。我能想到的最简单的解决方案是:

chromosomes = '1 2 3 4 5'.split()

rule master :
    input :
        expand('sample.REF_{chromosome}.bam',
            chromosome = chromosomes)


rule chromosome :
    output :
        touch('sample.REF_{chromosome}.bam')

    input : 'split.done'


rule split_bam :
    output :
        touch('split.done')

    input : 'sample.bam'

    run :
        print('splitting bam..')
        chromosome = 1
        for line in open(input[0]) :
            outfile = 'sample.REF_{}.bam'.format(chromosome)
            print(line, end = '', file = open(outfile, 'w'))
            chromosome += 1

产生空的sample_REF_{chromosome}.bam文件。我了解为什么会发生这种情况,甚至蛇标甚至警告例如

Warning: the following output files of rule chromosome were not present when the DAG was created:
{'sample.REF_3.bam'}
Touching output file sample.REF_3.bam.

也就是说,这些文件最初不在DAG中,snakemake用空版本触摸它们,从而删除其中的 。我想我对这种行为感到惊讶,并且想知道是否有充分的理由。请注意,此行为不仅限于snakemake的touch(),因为我应该将touch('sample.REF_{chromosome}.bam')替换为简单的'sample.REF_{chromosome}.bam',然后进行shell :触摸{output}`,我得到同样的结果。现在,当然,我找到了一个完全可以接受的解决方法:

chromosomes = '1 2 3 4 5'.split()

rule master :
    input :
        expand('sample.REF_{chromosome}.bam',
            chromosome = chromosomes)


rule chromosome :
    output : 'sample.REF_{chromosome}.bam'

    input : 'split_dir'

    shell : 'mv {input}/{output} {output}'


rule split_bam :
    output :
        temp(directory('split_dir'))

    input : 'sample.bam'

    run :
        print('splitting bam..')
        shell('mkdir {output}')
        chromosome = 1
        for line in open(input[0]) :
            outfile = '{}/sample.REF_{}.bam'.format(output[0], chromosome)
            print(line, end = '', file = open(outfile, 'w'))
            chromosome += 1

但令我惊讶的是,我不得不通过这些体操来完成一项看似简单的任务。因此,我想知道是否有更好的设计,或者我问的不是正确的问题。任何建议/想法都非常受欢迎。

1 个答案:

答案 0 :(得分:0)

我认为您的示例有些虚构,原因有两个。规则split_bam已经产生了最终输出sample.REF_{chromosome}.bam。同样,规则master使用从变量chromosomes提取的染色体,而规则split_bam遍历bam文件以获取染色体。

我的印象是您想要的东西可能是这样的:

chromosomes= '1 2 3 4 5'.split()

rule master:
    input:
        expand('sample.REF_{chromosome}.bam',
            chromosome = chromosomes)

rule split_bam :
    input:
        'sample.bam'
    output:
        expand('sample.split.{chromosome}.bam', chromosome= chromosomes)
    run:
        print('splitting bam..')
        for chromosome in chromosomes:
            outfile = 'sample.split.{}.bam'.format(chromosome)
            print(chromosome, end = '', file = open(outfile, 'w'))

rule chromosome:
    input:
        'sample.split.{chromosome}.bam'
    output:
        touch('sample.REF_{chromosome}.bam')