Question

我在使用Snakemake时遇到了一些麻烦，到目前为止我还没有找到相关的信息在文档中（或其他地方）。实际上，我有一个包含不同样本（多重分析）的大文件，并且我想根据规则后发现的结果停止某些样本的管道执行。

我已经尝试过从规则定义中更改此值（使用检查点或def），为遵循规则进行条件输入，并将通配符视为删除一项的简单列表。下面是我要执行的操作的示例（有条件的if仅在此处指示）：

# Import the config file(s)
configfile: "../PATH/configfile.yaml"

# Wildcards
sample = config["SAMPLE"]
lauch = config["LAUCH"]

# Rules

rule all:
    input:
        expand("PATH_TO_OUTPUT/{lauch}.{sample}.output", lauch=lauch, sample=sample)


rule one:
    input:
        "PATH_TO_INPUT/{lauch}.{sample}.input"
    output:
        temp("PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp")
    shell:
        """
        somescript.sh {input} {output}
        """

rule two:
    input:
        "PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp"
    output:
        "PATH_TO_OUTPUT/{lauch}.{sample}.output"
    shell:
        """
        somecheckpoint.sh {input}       # Print a message and write in the log file for now

        if [ file_dont_pass_checkpoint ]; then
            # Delete the correspondant sample to the wildcard {sample}
            # to continu the analysis only with samples who are pass the validation
        fi


        somescript2.sh {input} {output}
        """

如果有人有想法，我很感兴趣。预先感谢您的回答。

Answer 1

如果我理解正确，我认为这是一个有趣的情况。如果样品通过了某些检查，请继续对其进行分析。否则，请尽早停车。

在流水线的最后，每个样本都必须有一个StringValuePtr，因为无论检查结果如何，规则PATH_TO_OUTPUT/{lauch}.{sample}.output都会要求它。

您可以让规则执行检查，编写一个文件，该文件包含一个标志，该标志指示该样本是否通过了检查（例如标志PASS或FAIL）。然后根据该标志，进行分析的规则将进行完整分析（如果通过），或者如果标志为失败，则写入一个空文件（或任何内容）。要点如下：

all

如果您根本不想看到失败的，空的输出文件，则可以使用onsuccess指令在流水线的末尾删除它们：

rule all:
    input:
        expand('{sample}.output', sample= samples),

rule checker:
    input:
        '{sample}.input',
    output:
        '{sample}.check',
    shell:
        r"""
        if [ some_check_is_ok ]
        then
            echo "PASS" > {output}
        else
            echo "FAIL" > {output}
        fi
        """

rule do_analysis:
    input:
        chk= '{sample}.check',
        smp= '{sample}.input',
    output:
        '{sample}.output',
    shell:
        r"""
        if [ {input.chk} contains "PASS"]:
            do_long_analysis.sh {input.smp} > {output}
        else:
            > {output} # Do nothing: empty file
        """

Answer 2

对此类问题的规范解决方案是使用检查点。考虑以下示例：

import pandas as pd

def get_results(wildcards):
    qc = pd.read_csv(checkpoints.qc.get().output[0].open(), sep="\t")
    return expand(
        "results/processed/{sample}.txt", 
        sample=qc[qc["some-qc-criterion"] > config["qc-threshold"]]["sample"]
    )


rule all:
    input:
        get_results


checkpoint qc:
    input:
        expand("results/preprocessed/{sample}.txt", sample=config["samples"])
    output:
        "results/qc.tsv"
    shell:
        "perfom-qc {input} > {output}"


rule process:
    input:
        "results/preprocessed/{sample}.txt"
    output:
        "results/processed/{sample.txt}"
    shell:
        "process {input} > {output}"

想法如下：在管道中的某个点上，经过一些（比如说）预处理之后，您添加了一个检查点规则，该规则将汇总所有样本并生成某种QC表。然后，在其下游，有一个规则汇总样本（例如，规则all或工作流内部的其他汇总）。假设在该汇总中，您只想考虑通过QC的样本。为此，您可以通过输入函数确定所需的文件（"results/processed/{sample}.txt"），该函数读取由检查点规则生成的QC表。 Snakemake的检查点机制可确保在执行检查点后对该输入函数进行求值，以便您可以实际读取表结果，并根据该表中包含的qc标准对样本进行决策。重新评估DAG时，Snakemake会自动应用任何中间规则（例如process规则）。

使用Snakemake有条件执行多重分析

2 个答案: