编辑2:我弄清楚了。我把答案贴了回复。
编辑1:我在@bli建议和https://stackoverflow.com/a/41185568/1025741之后的问题末尾添加了解决方案的开头
我正在编写一个snakemake文件,我在其中解析样本表文件(在yaml配置文件中定义),以便连接此样本表中列出的文件。
样本表看起来像:
sample unit fq1 fq2
A lane1 A.l1.1.R1.txt A.l1.1.R2.txt
A lane1 A.l1.2.R1.txt A.l1.2.R2.txt
A lane2 A.l2.R1.txt A.l2.R2.txt
B lane1 B.l1.R1.txt B.l1.R2.txt
这个想法是从同一个样本和样本单元连接文件(在fq1和fq2中列出)。在这种情况下:
A.l1.1.R1.txt
和A.l2.2.R1.txt
将被连接A.l1.1.R2.txt
和A.l2.2.R2.txt
将被连接其他文件不会连接,但也会在此目录结构中报告:
{sample}/
{sample}_{unit}_merged_R1.txt
{sample}_{unit}_merged_R2.txt
所以最后这个例子我应该:
A/
A_lane1_merged_R1.txt
A_lane1_merged_R2.txt
A_lane2_merged_R1.txt
A_lane2_merged_R2.txt
B/
B_lane1_merged_R1.txt
B_lane1_merged_R2.txt
这是我的snakemake文件来执行这样的任务:
import pandas as pd
shell.executable("bash")
configfile: "config.yaml"
# open samplesheet
units = pd.read_table(config["units"], dtype=str)
units = units.set_index(["sample", "unit"])
rule all:
input:
expand("{sample}/{sample}_{unit}_merge_R1.txt",
sample=units.index.get_level_values('sample').unique(),
unit=units.index.get_level_values('unit').unique()),
expand("{sample}/{sample}_{unit}_merge_R2.txt",
sample=units.index.get_level_values('sample').unique(),
unit=units.index.get_level_values('unit').unique())
def get_fastq_r1(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()
def get_fastq_r2(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()
rule merge:
input:
r1 = get_fastq_r1,
r2 = get_fastq_r2
output:
"{sample}/{sample}_{unit}_merge_R1.txt",
"{sample}/{sample}_{unit}_merge_R2.txt"
shell:
"""
echo {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
echo {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
"""
和config.yaml:
units: units.tsv
但由于我没有单位= B
的样本lane2
,因此出现错误:
InputFunctionException in line 29 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
KeyError: ('B', 'lane2')
Wildcards:
sample=B
unit=lane2
有没有办法/技巧来避免这种错误? 感谢
解决方案的开始
在@bli建议后,我使用了一个过滤版本的itertools.product,将其包装在一个更高阶的生成器中,该生成器检查所产生的通配符组合是否在预先建立的列表中:
import pandas as pd
shell.executable("bash")
configfile: "config.yaml"
###
from itertools import product
def filter_combinator(combinator, inlist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
# Use frozenset instead of tuple
# in order to accomodate
# unpredictable wildcard order
if frozenset(wc_comb) in inlist:
yield wc_comb
return filtered_combinator
# open samplesheet
units = pd.read_table(config["units"], dtype=str)
# list of pair sample-unit included in the samplesheet
inList={
frozenset({("sample", "A"), ("unit", "lane1")}),
frozenset({("sample", "A"), ("unit", "lane2")}),
frozenset({("sample", "B"), ("unit", "lane1")})}
# set df index
units = units.set_index(["sample", "unit"])
# build new iterator
filtered_product = filter_combinator(product, inList)
rule all:
input:
expand("{sample}/{sample}_{unit}_merge_R1.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values),
expand("{sample}/{sample}_{unit}_merge_R2.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values)
def get_fastq_r1(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()
def get_fastq_r2(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()
rule merge:
input:
r1 = get_fastq_r1,
r2 = get_fastq_r2
output:
"{sample}/{sample}_{unit}_merge_R1.txt",
"{sample}/{sample}_{unit}_merge_R2.txt"
message:
"test"
shell:
"""
cat {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
cat {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
"""
但是在运行snakemake -n
:
Job 1: test
RuleException in line 53 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
有任何线索吗?
答案 0 :(得分:1)
以下是我根据https://stackoverflow.com/a/41185568/1025741找到的解决方案:
import pandas as pd
shell.executable("bash")
configfile: "config.yaml"
###
from itertools import product
def filter_combinator(combinator, inlist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
# Use frozenset instead of tuple
# in order to accomodate
# unpredictable wildcard order
if frozenset(wc_comb) in inlist:
yield wc_comb
return filtered_combinator
# open samplesheet
units = pd.read_table(config["units"], dtype=str)
# list of pair sample-unit
#inList=units[["sample","unit"]].drop_duplicates().to_dict('r')
inList={
frozenset({("sample", "A"), ("unit", "lane1")}),
frozenset({("sample", "A"), ("unit", "lane2")}),
frozenset({("sample", "B"), ("unit", "lane1")})}
# set df index
units=units.set_index(["sample","unit"])
# build new iterator
filtered_product = filter_combinator(product, inList)
rule all:
input:
expand("{sample}/{sample}_{unit}_merge_R1.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values),
expand("{sample}/{sample}_{unit}_merge_R2.txt",
filtered_product,
sample=units.index.get_level_values('sample').unique().values,
unit=units.index.get_level_values('unit').unique().values)
def get_fastq_r1(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()
def get_fastq_r2(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()
rule merge:
input:
r1=get_fastq_r1,
r2=get_fastq_r2
output:
r1_o="{sample}/{sample}_{unit}_merge_R1.txt",
r2_o="{sample}/{sample}_{unit}_merge_R2.txt"
message:
"test"
shell:
"""
cat {input.r1} > {output.r1_o}
cat {input.r2} > {output.r2_o}
"""