Snakemake:未使用的通配符出错

时间:2018-02-12 13:35:46

标签: python wildcard snakemake

  

编辑2:我弄清楚了。我把答案贴了回复。

     

编辑1:我在@bli建议和https://stackoverflow.com/a/41185568/1025741之后的问题末尾添加了解决方案的开头

我正在编写一个snakemake文件,我在其中解析样本表文件(在yaml配置文件中定义),以便连接此样本表中列出的文件。

样本表看起来像:

sample  unit    fq1 fq2
A   lane1   A.l1.1.R1.txt   A.l1.1.R2.txt
A   lane1   A.l1.2.R1.txt   A.l1.2.R2.txt
A   lane2   A.l2.R1.txt A.l2.R2.txt
B   lane1   B.l1.R1.txt B.l1.R2.txt

这个想法是从同一个样本和样本单元连接文件(在fq1和fq2中列出)。在这种情况下:

  • A.l1.1.R1.txtA.l2.2.R1.txt将被连接
  • A.l1.1.R2.txtA.l2.2.R2.txt将被连接

其他文件不会连接,但也会在此目录结构中报告:

{sample}/
    {sample}_{unit}_merged_R1.txt
    {sample}_{unit}_merged_R2.txt

所以最后这个例子我应该:

A/
  A_lane1_merged_R1.txt
  A_lane1_merged_R2.txt
  A_lane2_merged_R1.txt
  A_lane2_merged_R2.txt
B/
  B_lane1_merged_R1.txt
  B_lane1_merged_R2.txt

这是我的snakemake文件来执行这样的任务:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

# open samplesheet
units = pd.read_table(config["units"], dtype=str)
units = units.set_index(["sample", "unit"])


rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            sample=units.index.get_level_values('sample').unique(),
            unit=units.index.get_level_values('unit').unique()),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            sample=units.index.get_level_values('sample').unique(),
            unit=units.index.get_level_values('unit').unique())


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()


rule merge:
    input:
        r1 = get_fastq_r1,
        r2 = get_fastq_r2
    output:
        "{sample}/{sample}_{unit}_merge_R1.txt",
        "{sample}/{sample}_{unit}_merge_R2.txt"
    shell:
        """
        echo {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
        echo {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
        """

和config.yaml:

units: units.tsv

但由于我没有单位= B的样本lane2,因此出现错误:

InputFunctionException in line 29 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
KeyError: ('B', 'lane2')
Wildcards:
sample=B
unit=lane2

有没有办法/技巧来避免这种错误? 感谢

  解决方案的开始

在@bli建议后,我使用了一个过滤版本的itertools.product,将其包装在一个更高阶的生成器中,该生成器检查所产生的通配符组合是否在预先建立的列表中:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

### 
from itertools import product

def filter_combinator(combinator, inlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            # Use frozenset instead of tuple
            # in order to accomodate
            # unpredictable wildcard order
            if frozenset(wc_comb) in inlist:
                yield wc_comb
    return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit included in the samplesheet
inList={
    frozenset({("sample", "A"), ("unit", "lane1")}),
    frozenset({("sample", "A"), ("unit", "lane2")}),
    frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units = units.set_index(["sample", "unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
    input:
        r1 = get_fastq_r1,
        r2 = get_fastq_r2
    output:
        "{sample}/{sample}_{unit}_merge_R1.txt",
        "{sample}/{sample}_{unit}_merge_R2.txt"
    message:
        "test"
    shell:
        """
        cat {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
        cat {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
        """

但是在运行snakemake -n

时它会返回一个错误
Job 1: test

RuleException in line 53 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}

有任何线索吗?

1 个答案:

答案 0 :(得分:1)

以下是我根据https://stackoverflow.com/a/41185568/1025741找到的解决方案:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

### 
from itertools import product

def filter_combinator(combinator, inlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            # Use frozenset instead of tuple
            # in order to accomodate
            # unpredictable wildcard order
            if frozenset(wc_comb) in inlist:
                yield wc_comb
    return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit
#inList=units[["sample","unit"]].drop_duplicates().to_dict('r')
inList={
    frozenset({("sample", "A"), ("unit", "lane1")}),
    frozenset({("sample", "A"), ("unit", "lane2")}),
    frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units=units.set_index(["sample","unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
    input:
        r1=get_fastq_r1,
        r2=get_fastq_r2
    output:
        r1_o="{sample}/{sample}_{unit}_merge_R1.txt",
        r2_o="{sample}/{sample}_{unit}_merge_R2.txt"
    message:
        "test"
    shell:
        """
        cat {input.r1} > {output.r1_o}
        cat {input.r2} > {output.r2_o}
        """