执行检查点中间命令

时间:2019-05-23 17:28:36

标签: bioinformatics snakemake

我目前遇到了一些问题,snakemake运行了检查点所需的中间规则。尝试解决此问题后,我相信问题出在aggregate_input函数中的expand命令之内,但无法弄清楚其行为方式。

这是我在https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution

之后建模的来自snakemake的当前检查点文档。
rule all:
    input:
    ¦   expand("string_tie_assembly/{sample}.gtf", sample=sample),
    ¦   expand("combined_fasta/{sample}.fa", sample=sample),
    ¦   "aggregated_fasta/all_fastas_combined.fa"




checkpoint clustering:
    input:
    ¦   "string_tie_assembly_merged/merged_{sample}.gtf"
    output:
    ¦   clusters = directory("split_gtf_file/{sample}")
    shell:
    ¦   """
    ¦   mkdir -p split_gtf_file/{wildcards.sample} ;

collapse_gtf_file.py -gtf {input} -o split_gtf_file/{wildcards.sample}/{wildcards.sample}
    ¦   """

rule gtf_to_fasta:
    input:
    ¦   "split_gtf_file/{sample}/{sample}_{i}.gtf"
    output:
    ¦   "lncRNA_fasta/{sample}/canidate_{sample}_{i}.fa"
    shell:
    ¦   "gffread -w {output} -g {reference} {input}"

rule rename_fasta_files:
    input:
    ¦   "lncRNA_fasta/{sample}/canidate_{sample}_{i}.fa"
    output:
    ¦   "lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa"
    shell:
    ¦   "seqtk rename {input} {wildcards.sample}_{i} > {output}"

#Gather N number of output files from the GTF split
def aggregate_input(wildcards):
    checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
    x = expand("lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa",
    ¦   sample=sample,
    ¦   i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fa")).i)
    print(x)
    return x

#Aggregate fasta from split GTF files together
rule combine_fasta_file:
    input:
    ¦   aggregate_input
    output:
    ¦   "combined_fasta/{sample}.fa"
    shell:
        "cat {input} > {output}"


    ¦   aggregate_input
    output:
    ¦   "combined_fasta/{sample}.fa"
    shell:
    ¦   "cat {input} > {output}"

#Aggegate aggregated fasta files
def gather_files(wildcards):
    files = expand("combined_fasta/{sample}.fa", sample=sample)
    return(files)

rule aggregate_fasta_files:
    input:
    ¦   gather_files
    output:
    ¦  "aggregated_fasta/all_fastas_combined.fa"
    shell:
    ¦   "cat {input} > {output}"

我一直遇到的问题是,在运行snakemake文件时,combine_fasta_file规则不会运行。在花了更多的时间解决此错误之后,我意识到问题是aggregate_input函数没有扩展,并返回了一个空列表[]而不是我期望的空列表{目录已展开,即:lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa

这很奇怪,尤其是考虑到checkpoint clustering确实运行正确并且下游输出文件位于rule all

有人知道为什么会这样吗?或有可能是这种情况。

用于运行snakemake的命令:snakemake -rs Assemble_regions.snake --configfile snake_config_files / annotated_group_config.yaml

1 个答案:

答案 0 :(得分:0)

只是弄清楚了。问题是我的aggregat e命令定位了错误的文件。以前我把它写成

def aggregate_input(wildcards):
    checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
    x = expand("lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa",
    ¦   sample=sample,
    ¦   i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fa")).i)
    print(x)
    return x

此问题是针对的是错误的文件。它应该代替{i}.fa产生的文件,而不是globbig checkpoint clustering。因此,将此代码更改为

def aggregate_input(wildcards):
    checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
    print(checkpoint_output)
    x = expand("lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa",
    ¦   sample=wildcards.sample,
    ¦   i=glob_wildcards(os.path.join(checkpoint_output, "{sample}_{i}.gtf")).i)
    print(x)
    return x

解决了该问题。