这是一个比报告here更复杂的案例。 我的输入文件如下:
ont_Hv1_2.5+.fastq
ont_Hv2_2.5+.fastq
pacBio_Hv1_1-1.5.fastq
pacBio_Hv1_1.5-2.5.fastq
pacBio_Hv1_2.5+.fastq
pacBio_Hv2_1-1.5.fastq
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq
我想只处理现有的输入文件,即会自动跳过与不存在的输入文件对应的通配符组合。
我的Snakefile看起来像这样:
import glob
import os.path
from itertools import product
#make wildcards regexps non-greedy:
wildcard_constraints:
capDesign = "[^_/]+",
sizeFrac = "[^_/]+",
techname = "[^_/]+",
# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)
# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
AUTHORIZEDCOMBINATIONS.append(tup)
# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
def filtered_combinator(*args, **kwargs):
for wc_comb in product(*args, **kwargs):
for ac in AUTHORIZEDCOMBINATIONS:
if(wc_comb[0:3] == ac):
print ("SUCCESS")
yield(wc_comb)
break
return filtered_combinator
filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)
rule all:
input:
expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)
#get read lengths for all FASTQ files:
rule getReadLength:
input: "{techname}_{capDesign}_{sizeFrac}.fastq"
output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s\"\\t\"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field
#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
output: "{techname}_{capDesign}_all.readlength.tsv"
shell: "cat {input} > {output}"
规则getReadLength
收集每个输入FASTQ的读取长度(即每个techname, capDesign, sizeFrac
组合)。
规则aggReadLength
合并getReadLength
为每个techname, capDesign
组合生成的读取长度统计信息。
工作流程失败,并显示以下消息:
Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq
因此,应用于目标的通配符组合过滤步骤似乎不会传播到它所依赖的所有上游规则。谁知道如何做到这一点?
(使用Snakemake版本4.4.0。)
提前多多感谢
答案 0 :(得分:1)
我想我解决了这个问题,希望这对其他人有用。
在aggReadlength
规则中,我更换了
input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
与
input: lambda wildcards: expand("{techname}_{capDesign}_{sizeFrac}.readlength.tsv", filtered_product, techname=wildcards.techname, capDesign=wildcards.capDesign, sizeFrac=SIZEFRACS)