我正在编写snakemake管道以获取公开可用的sra文件,将它们转换为fastq文件,然后通过对齐,调峰和LD分数回归来运行它们。
我在下面的名为SRA2fastq
的规则中遇到问题,在该规则中,我使用parallel-fastq-dump
将SRA文件转换为成对的最终fastq文件。该规则为每个SRA文件SRRXXXXXXX_1
和SRRXXXXXXX_2
生成两个输出。
这是我的配置文件:
samples:
fullard2018_NpfcATAC_1: SRR5367824
fullard2018_NpfcATAC_2: SRR5367798
fullard2018_NpfcATAC_3: SRR5367778
fullard2018_NpfcATAC_4: SRR5367754
fullard2018_NpfcATAC_5: SRR5367729
这是我的Snakefile的前几条规则:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
"FastQC/fastq_multiqc.html",
expand("peak_files/{sample}_peaks.blrm.narrowPeak", sample=config['samples']),
"peak_files/Fullard2018_peaks.mrgd.blrm.narrowPeak",
expand("LD_annotation_files/Fullard_2018.{chr}.l2.ldscore.gz", chr=range(1,23))
rule SRA_prefetch:
params:
SRA="{SRA}"
output:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
log:
"logs/prefetch/{SRA}.log"
shell:
"prefetch {params.SRA}"
rule SRA2fastq:
input:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
output:
"fastq_files/{SRA}_1.fastq.gz",
"fastq_files/{SRA}_2.fastq.gz"
log:
"logs/SRA2fastq/{SRA}.log"
shell:
"""
parallel-fastq-dump --sra-id {input} --threads 8 \
--outdir fastq_files --split-files --gzip
"""
rule fastqc:
input:
rules.SRA2fastq.output
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{SRA}_{num}_fastqc.html"
log:
"logs/FASTQC/{SRA}_{num}.log"
wrapper:
"0.27.1/bio/fastqc"
rule multiqc_fastq:
input:
lambda wildcards: expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.27.1/bio/multiqc"
rule bowtie2:
input:
sample=lambda wildcards: expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=config['samples'][wildcards.sample], num=[1,2])
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.27.1/bio/bowtie2/align"
但是,当我运行Snakefile时,出现以下错误:
Error in job SRA2fastq while creating output files fastq_files/SRR5367754_1.fastq.gz, fastq_files/SRR5367754_2.fastq.gz
我之前已经多次看到此错误,通常是由程序生成的输出文件的名称与您在相应的snakemake规则中指定的输出文件的名称不完全匹配时引起的。但是,这里不是这种情况,就好像我为该特定规则单独运行snakemake生成的命令一样,文件将按预期方式创建且文件名匹配。这是运行snakemake -np
之后采用的规则的一个实例的示例:
rule SRA2fastq:
input: /home/c1477909/ncbi/public/sra/SRR5367779.sra
output: fastq_files/SRR5367779_1.fastq.gz, fastq_files/SRR5367779_2.fastq.gz
log: logs/SRA2fastq/SRR5367779.log
jobid: 18
wildcards: SRA=SRR5367779
parallel-fastq-dump --sra-id /home/c1477909/ncbi/public/sra/SRR5367779.sra --threads 8 --outdir fastq_files --split-files --gzip
请注意,parallel-fastq-dump
命令生成的输出文件是分别运行(即不使用snakemake的),其命名方式与SRA2fastq
规则中指定的命名相同:
ls fastq_files
SRR5367729_1.fastq.gz SRR5367729_2.fastq.gz
对此我有些困惑,因为通常可以很容易地纠正此错误,但我无法弄清楚问题出在哪里。我尝试将SRA2fastq
的输出部分更改为:
output:
file1="fastq_files/{SRA}_1.fastq.gz",
file2="fastq_files/{SRA}_2.fastq.gz"
但是,这将引发相同的错误。我还尝试仅指定一个输出文件,但这会在以后出现bowtie2
错误时影响input files missing
规则。
有什么想法吗?尝试在单个规则中查找多个输出文件时,我会缺少什么东西吗?
非常感谢