VEP snakemake包装器存在问题

时间:2020-08-11 18:18:16

标签: wrapper snakemake

我在尝试为snakemake运行VEP包装程序时遇到两个问题。

首先,我想像这样在lambda wildcards中使用calls

calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))

def getVCFs(sample):
  return(list(os.path.join(callings_dict[sample],"{0}_sorted_dedupped_snp_varscan.vcf".format(sample,pair)) for pair in ['']))

rule variant_annotation:
    input:
        calls= lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
        calls="variants.annotated.vcf",
        stats="variants.html"
    params:
        plugins=["LoFtool"],
        extra="--everything"
    message: """--- Annotating Variants."""
    resources:
        mem = 30000,
        time = 120
    threads: 4
    wrapper:
        "0.64.0/bio/vep/annotate"

但是,我得到一个错误:

当我将lambda wildcards替换为calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)时([这不理想-出于原因请参阅此帖子] [1]),它给我有关resources文件夹的错误吗?

(snakemake) [moldach@cedar1 MTG353]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 333 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Missing input files for rule variant_annotation:
resources/vep/cache
resources/vep/plugins

我也[从文档中不清楚如何知道应该指定哪个参考基因组(版本,_etc。)[2]。

更新:

由于字符数的限制,我什至无法回答两个答复者,因此我将在此处继续问题

正如@jafors提到的,两个包装器解决了cacheplugins的问题-谢谢!

现在通过以下规则,尝试运行VEP时出现错误:

rule variant_annotation:
    input:
        calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
        calls=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.annotated.vcf', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),
        stats=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.html', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names)
    params:
        plugins=["LoFtool"],
        extra="--everything"
    message: """--- Annotating Variants."""
    resources:
        mem = 30000,
        time = 120
    threads: 4
    wrapper:
        "0.64.0/bio/vep/annotate"

这是我从日志中得到的错误:

Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       variant_annotation
        1

[Wed Aug 12 20:22:49 2020]
Job 0: --- Annotating Variants.

Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Traceback (most recent call last):
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpwx1u_776.wrapper.py", line 36, in <module>
    if snakemake.output.calls.endswith(".vcf.gz"):
AttributeError: 'Namedlist' object has no attribute 'endswith'
[Wed Aug 12 20:22:53 2020]
Error in rule variant_annotation:
    jobid: 0
    output: ANNOTATION/VEP/BC1217.annotated.vcf, ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/MTG109.annotated.vcf, ANNOTATION/VEP/BC1217.html, ANNOTATION/VEP/470.html, ANNOTATION/VEP/MTG$
    conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f

RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail;  python /scratch/moldach/MADDOG/VCF-FILE$
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

要清除

这是我在尝试包装程序之前运行VEP的代码,因此我想保留类似的选项(例如脱机,):

vep \
        -i {input.sample} \
        --species "caenorhabditis_elegans" \
        --format "vcf" \
        --everything \
        --cache_version 100 \
        --offline \
        --force_overwrite \
        --fasta {input.ref} \
        --gff {input.annot} \
        --tab \
        --variant_class \
        --regulatory \
        --show_ref_allele \
        --numbers \
        --symbol \
        --protein \
        -o {params.sample}

更新2:

是的,使用expand()是个问题。我记得这就是为什么我喜欢使用lambdaos.path.join()作为规则input/output的原因,除非您在rule all中提到:

尽管我遇到了一个新的问题,但以下内容似乎摆脱了这个问题:

rule variant_annotation:
    input:
        calls= lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
        calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
        stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")

不确定我为什么会出现unknown file type错误-正如我提到的那样,首先使用具有相同输入数据的完整命令对它进行了测试?

Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpsh388k23.wrapper.py", line 44, in <module>
    "(bcftools view {snakemake.input.calls} | "
  File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  (bcftools view VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cach$
[Thu Aug 13 09:02:22 2020]

更新3:

bcftools viewsamtools mpileup / varscan pileup2snp的输出发出警告:

def getDeduppedBamsIndex(sample):
  return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))

rule mpilup:
    input:
    bam=lambda wildcards: getDeduppedBams(wildcards.sample),
        reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
    output:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
    log:
        os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
    params:
        extra=lambda wc: "-r {}".format(wc.contig)
    resources:
    mem = 1000,
        time = 30
    wrapper:
    "0.65.0/bio/samtools/mpileup"

rule mpileup_to_vcf:
    input:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
    output:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
    message:
    "Calling SNP with Varscan2"
    threads:
    2 # Keep threading value to one for unzipped mpileup input
          # Set it to two for zipped mipileup files
    log:
        os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
    resources:
    mem = 1000,
        time = 30
    wrapper:
    "0.65.0/bio/varscan/mpileup2snp"

rule vcf_merge:
    input:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
    output:
    os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
    log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
    resources:
    mem = 1000,
        time = 10
    threads: 1
    message: """--- Merge VarScan by Chromosome."""
    shell: """
    awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
        """

calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))

def getVCFs(sample):
  return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))

rule annotate_variants:
    input:
    calls=lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
    calls="{sample}.annotated.vcf",
        stats="{sample}.html"
    params:
    # Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
        # Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
        plugins=["LoFtool"],
        extra="--everything"  # optional: extra arguments
    log:
        "logs/vep/{sample}.log"
    threads: 4
    resources:
    time=30,
        mem=5000
    wrapper:
    "0.65.0/bio/vep/annotate"

如果我在输出上运行bcftools view,则会收到错误消息:

$ bcftools view variant_calling/varscan/MTG324.vcf 
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type

2 个答案:

答案 0 :(得分:0)

  1. 关于使用expand vs通配符,一点都没有关系。生物明星的文章只是关于如何保持可读性的建议。在蛇形/程序化方面,只要输入正确就无所谓。

  2. 对资源的抱怨是,您在规则pipeline { agent { node { label 'slave_node' } } stages { stage('Test') { agent { dockerfile true } steps { sh 'cat /etc/os-release' sh 'curl --version' sh 'echo Successfully compiled' } } } } 的输入中定义variant_annotationresources/vep/cache是能够运行resources/vep/plugins的必要输入。出现此错误后,snakemake会有效地告诉您这些文件不存在,因此它无法为您运行规则。

  3. 当我查看文档中的代码时,似乎作为输入的缓存目录应该定义您使用的基因组:

variant_annotation

答案 1 :(得分:0)

除了Maarten所说的(resources/vep/cacheresources/vep/plugins只是所需输入的示例路径,它还定义了您要使用的基因组和版本),您可以获取缓存和插件目录使用以下包装,轻松地在Snakefile中使用其他两个简单规则:

编辑

很高兴为您的第一个问题解决了这个问题。 第二个错误似乎是由输出中的expand引起的。 我是否正确理解您要一一注释所有vcfs?因此输入为{sample}.vcf,输出为{sample}.annotated.vcf

在这种情况下,您可能不想在此规则中使用expand

我也不确定,为什么这里需要{ANNOT_DIR}{ANNOT_TOOL}作为通配符。我猜想如果您使用的是VEP,ANNOT_TOOL将始终为VEP,而ANNOT_DIR将始终为ANNOTATION? 然后,您可以将它们直接写为ANNOTATION/VEP/{sample}.annotated.vcf

{CALLING_DIR}相同,我想这将始终是同一目录,对吗?我知道如果您在样本上使用多个调用者,{CALLING_TOOL}可能会有多个值。

如果我仍处于正轨,则有两个通配符,{sample}{CALLING_TOOL}在使用VEP时可能需要扩展。

只要写

input:
    calls: 'CALLDIR/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf',
    cache="resources/vep/cache",
    plugins="resources/vep/plugins"
output:
    calls='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf',
    stats='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.html'

expand属于您的规则all或同时使用所有带注释的vcfs的任何其他目标规则。像这样:

rule all:
    input: expand('ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf', CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)

然后,variant_annotation规则将运行您在规则all中扩展的所有示例。

希望我能正确理解您的想法,这对您有帮助。

EDIT2

好的,好像我们快完成了。 bcftools view引发了您收到的错误-这表明vcf可能有问题。

您是否尝试过bcftools view在Snakefile之外使用vcf?如果在此规则期间出现问题,或者如果vcf已经存在某种问题,那么这将给我们一个思路。