Snakemake:尝试生成多个输出文件时出错

时间:2018-08-21 10:03:31

标签: output bioinformatics snakemake

我正在编写snakemake管道以获取公开可用的sra文件,将它们转换为fastq文件,然后通过对齐,调峰和LD分数回归来运行它们。

我在下面的名为SRA2fastq的规则中遇到问题,在该规则中,我使用parallel-fastq-dump将SRA文件转换为成对的最终fastq文件。该规则为每个SRA文件SRRXXXXXXX_1SRRXXXXXXX_2生成两个输出。

这是我的配置文件:

samples:
    fullard2018_NpfcATAC_1: SRR5367824
    fullard2018_NpfcATAC_2: SRR5367798
    fullard2018_NpfcATAC_3: SRR5367778
    fullard2018_NpfcATAC_4: SRR5367754
    fullard2018_NpfcATAC_5: SRR5367729

这是我的Snakefile的前几条规则:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
        expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
        "FastQC/fastq_multiqc.html",
        expand("peak_files/{sample}_peaks.blrm.narrowPeak", sample=config['samples']),
        "peak_files/Fullard2018_peaks.mrgd.blrm.narrowPeak",
        expand("LD_annotation_files/Fullard_2018.{chr}.l2.ldscore.gz", chr=range(1,23))

rule SRA_prefetch:
    params:
        SRA="{SRA}"
    output:
        "/home/c1477909/ncbi/public/sra/{SRA}.sra"
    log:
        "logs/prefetch/{SRA}.log"
    shell:
        "prefetch {params.SRA}"

rule SRA2fastq:
    input:
        "/home/c1477909/ncbi/public/sra/{SRA}.sra"
    output:
        "fastq_files/{SRA}_1.fastq.gz",
        "fastq_files/{SRA}_2.fastq.gz"
    log:
        "logs/SRA2fastq/{SRA}.log"
    shell:
        """
        parallel-fastq-dump --sra-id {input} --threads 8 \
        --outdir fastq_files --split-files --gzip
        """

rule fastqc:
    input:
        rules.SRA2fastq.output
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/{SRA}_{num}_fastqc.html"
    log:
        "logs/FASTQC/{SRA}_{num}.log"
    wrapper:
        "0.27.1/bio/fastqc"

rule multiqc_fastq:
    input:
        lambda wildcards: expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2])
    output:
        "FastQC/fastq_multiqc.html"
    wrapper:
        "0.27.1/bio/multiqc"

rule bowtie2:
    input:
        sample=lambda wildcards: expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=config['samples'][wildcards.sample], num=[1,2])
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
        index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
        extra=""
    threads: 8
    wrapper:
       "0.27.1/bio/bowtie2/align"

但是,当我运行Snakefile时,出现以下错误:

Error in job SRA2fastq while creating output files fastq_files/SRR5367754_1.fastq.gz, fastq_files/SRR5367754_2.fastq.gz

我之前已经多次看到此错误,通常是由程序生成的输出文件的名称与您在相应的snakemake规则中指定的输出文件的名称不完全匹配时引起的。但是,这里不是这种情况,就好像我为该特定规则单独运行snakemake生成的命令一样,文件将按预期方式创建且文件名匹配。这是运行snakemake -np之后采用的规则的一个实例的示例:

rule SRA2fastq:
    input: /home/c1477909/ncbi/public/sra/SRR5367779.sra
    output: fastq_files/SRR5367779_1.fastq.gz, fastq_files/SRR5367779_2.fastq.gz
    log: logs/SRA2fastq/SRR5367779.log
    jobid: 18
    wildcards: SRA=SRR5367779

    parallel-fastq-dump --sra-id /home/c1477909/ncbi/public/sra/SRR5367779.sra --threads 8 --outdir fastq_files --split-files --gzip

请注意,parallel-fastq-dump命令生成的输出文件是分别运行(即不使用snakemake的),其命名方式与SRA2fastq规则中指定的命名相同:

ls fastq_files
SRR5367729_1.fastq.gz  SRR5367729_2.fastq.gz

对此我有些困惑,因为通常可以很容易地纠正此错误,但我无法弄清楚问题出在哪里。我尝试将SRA2fastq的输出部分更改为:

    output:
        file1="fastq_files/{SRA}_1.fastq.gz",
        file2="fastq_files/{SRA}_2.fastq.gz"

但是,这将引发相同的错误。我还尝试仅指定一个输出文件,但这会在以后出现bowtie2错误时影响input files missing规则。

有什么想法吗?尝试在单个规则中查找多个输出文件时,我会缺少什么东西吗?

非常感谢

0 个答案:

没有答案