我的目标与Snakemake: unknown output/input files after splitting by chromosome中的目标相似,但是,如前所述,我确实知道我的sample.bam
文件中有例如 5条染色体。用作玩具示例:
$ cat sample.bam
chromosome 1
chromosome 2
chromosome 3
chromosome 4
chromosome 5
我希望“拆分”这个bam文件,然后在生成的染色体上进行一堆每个染色体下游工作。我能想到的最简单的解决方案是:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output :
touch('sample.REF_{chromosome}.bam')
input : 'split.done'
rule split_bam :
output :
touch('split.done')
input : 'sample.bam'
run :
print('splitting bam..')
chromosome = 1
for line in open(input[0]) :
outfile = 'sample.REF_{}.bam'.format(chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
产生空的sample_REF_{chromosome}.bam
文件。我了解为什么会发生这种情况,甚至蛇标甚至警告例如,
Warning: the following output files of rule chromosome were not present when the DAG was created:
{'sample.REF_3.bam'}
Touching output file sample.REF_3.bam.
也就是说,这些文件最初不在DAG中,snakemake用空版本触摸它们,从而删除其中的 。我想我对这种行为感到惊讶,并且想知道是否有充分的理由。请注意,此行为不仅限于snakemake的touch()
,因为我应该将touch('sample.REF_{chromosome}.bam')
替换为简单的'sample.REF_{chromosome}.bam'
,然后进行shell :
触摸{output}`,我得到同样的结果。现在,当然,我找到了一个完全可以接受的解决方法:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output : 'sample.REF_{chromosome}.bam'
input : 'split_dir'
shell : 'mv {input}/{output} {output}'
rule split_bam :
output :
temp(directory('split_dir'))
input : 'sample.bam'
run :
print('splitting bam..')
shell('mkdir {output}')
chromosome = 1
for line in open(input[0]) :
outfile = '{}/sample.REF_{}.bam'.format(output[0], chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
但令我惊讶的是,我不得不通过这些体操来完成一项看似简单的任务。因此,我想知道是否有更好的设计,或者我问的不是正确的问题。任何建议/想法都非常受欢迎。
答案 0 :(得分:0)
我认为您的示例有些虚构,原因有两个。规则split_bam
已经产生了最终输出sample.REF_{chromosome}.bam
。同样,规则master
使用从变量chromosomes
提取的染色体,而规则split_bam
遍历bam文件以获取染色体。
我的印象是您想要的东西可能是这样的:
chromosomes= '1 2 3 4 5'.split()
rule master:
input:
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule split_bam :
input:
'sample.bam'
output:
expand('sample.split.{chromosome}.bam', chromosome= chromosomes)
run:
print('splitting bam..')
for chromosome in chromosomes:
outfile = 'sample.split.{}.bam'.format(chromosome)
print(chromosome, end = '', file = open(outfile, 'w'))
rule chromosome:
input:
'sample.split.{chromosome}.bam'
output:
touch('sample.REF_{chromosome}.bam')