Question

我有一个snakemake工作流程，该流程失败了，因为最后一个作业创建了两个输出文件，或者一个都不创建。我尝试使用检查点解决它，但是当我尝试在聚合函数中整理输出文件时，我认为通配符卡住了。

工作流程（1）从biom社区配置文件中创建一个fasta文件。然后在fasta文件上运行计算机模拟PCR（2），这将创建一个txt文件作为输出。

最后一步是解析器（3），它输出一个csv和一个fasta文件。但是，如果txt文件中没有匹配项（又称insilico PCR未产生结果），则它不会创建csv或fasta文件。

SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()

TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta", sample = SAMPLES, id = ID)

rule all:
    input:
        TARGETS

rule getgenome:
    input:
        "input/metaphlan/{sample}.biom"
    output:
        csv="output/metaphlan/fasta_dump/{sample}.csv",
        fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
    conda:
        "envs/synth_genome.yaml"
    shell:
        "python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"

rule PCR:
    input:
        "output/metaphlan/fasta_dump/{sample}_dump.fasta"
    output:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    params:
        id = "{id}"
    shell:
        "software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"

rule parse:
    input:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
    shell:
        "python scripts/iPCRess_parser_v2.py {input} {output}"

空运行很好-没有错误。但是，如果我进行正确的运行，snakemake会中止它，并说作业执行失败：

Waiting at most 5 seconds for missing files.
MissingOutputException in line 31 of snakeflow/Snakefile:
Missing files after 5 seconds:
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.csv
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

我知道我可以更改解析器脚本以仅创建两个空文件，但我不想创建不必要的文件。我研究了动态功能，但不适用于两个潜在的输出文件，因此我查看了检查点。据我了解，这应该可以帮助我解决问题。

这是我使用检查点的尝试：

SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()

TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt", sample = SAMPLES, id = ID)
print(TARGETS)

rule all:
    input:
        TARGETS

rule getgenome:
    input:
        "input/metaphlan/{sample}.biom"
    output:
        csv="output/metaphlan/fasta_dump/{sample}.csv",
        fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
    conda:
        "envs/synth_genome.yaml"
    shell:
        "python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"

rule PCR:
    input:
        "output/metaphlan/fasta_dump/{sample}_dump.fasta"
    output:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    params:
        id = "{id}"
    shell:
        "software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"

checkpoint parse:
    input:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
    shell:
        "python scripts/iPCRess_parser_v2.py {input} {output}"

def aggregate_input(wildcards):
    checkpoint_output = checkpoints.parse.get(**wildcards).output[0,1]
    return expand('output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv','output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta', sample = wildcards.SAMPLES, id=wildcards.ID)

rule collect:
    input:
        aggregate_input
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt"
    shell:
        "cat {input} >> {output}"

和错误


<function aggregate_input at 0x7f63eade2158>
SyntaxError:
Input and output files have to be specified as strings or lists of strings.
  File "snakeflow/Snakefile", line 52, in <module>

我相信这是因为我在聚合函数中使用通配符的方式有问题，但我无法弄清楚。我尝试了在检查点教程中找到的各种版本，但都无济于事。

非常感谢您的帮助，谢谢！

Answer 1

我认为直接的错误与您的展开有关：

return expand('output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv',
   'output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta', 
   sample=wildcards.SAMPLES, id=wildcards.ID)

“ SAMPLES”和“ ID”应为小写字母，以匹配通配符名称。

您仍然会遇到缺少的输出异常，因为您指定的是运行脚本后可能不存在的输出文件。您必须将检查点的输出更改为一个目录（特定于每个样本，ID），该目录将包含0或2个文件。然后，您可以在输入函数中浏览该目录的内容，以查看存在哪些文件。

对我来说，我会选择空文件路由。您可以避免检查点，并且可以使规则更整洁。使用空文件，您可以将尚未运行的内容与没有结果的内容区分开。请注意，如果您使用检查点，则最终将得到空目录，因此您将无法完全避免emtpy文件问题。

如果您担心inode或系统中的其他内容，请将输出标记为temp，snakemake会在聚合后清除它们。

snakemake-检查点和通配符

1 个答案: