将检查点与snakemake一起使用可为规则的每个实例提供所有输入文件

时间:2020-07-21 16:39:28

标签: snakemake

我最近在checkpoints上遇到过蛇行制造,并意识到它们可以与我正在尝试的工作完美配合。我已经能够实现工作流程listed here。我还发现了this stackoverflow question,但对它的含义或我如何使其适合我的工作一无所知

我正在使用的规则如下:

def ReturnBarcodeFolderNames():
    path = config['results_folder'] + "Barcode/"
    return_direc = []
    for root, directory, files in os.walk(path):
        for direc in directory:
            return_direc.append(direc)
    return return_direc


rule all:
    input:
        expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=ReturnBarcodeFolderNames())


checkpoint barcode:
    input:
        expand(config['results_folder'] + "Basecall/{fast5_files}", fast5_files=FAST5_FILES)
    output:
        temp(directory(config['results_folder'] + "Barcode/.tempOutput/"))
    shell:
        "guppy_barcoder "
        "--input_path {input} "
        "--save_path {output} "
        "--barcode_kits EXP-PBC096 "
        "--recursive"

def aggregate_barcode_folders(wildcards):
    checkpoint_output = checkpoints.barcode.get(**wildcards).output[0]
    folder_names = []
    for root, directories, files in os.walk(checkpoint_output):
        for direc in directories:
            folder_names.append(direc)

    return expand(config['results_folder'] + "Barcode/.tempOutput/{folder}", folder=folder_names)

rule merge:
    input:
        aggregate_barcode_folders
    output:
        config['results_folder'] + "Barcode/{folder}.merged.fastq"
    shell:
         "echo {input}"

rule barcodedef aggregate_barcode_folders可以正常工作,但是到达rule merge时,每个输入文件夹都将传递到规则的每个实例。结果如下:

rule merge:
    input: /Results/Barcode/.tempOutput/barcode81, 
/Results/Barcode/.tempOutput/barcode28, 
/Results/Barcode/.tempOutput/barcode17, 
/Results/Barcode/.tempOutput/barcode10, 
/Results/Barcode/.tempOutput/barcode26, 
/Results/Barcode/.tempOutput/barcode21, 
/Results/Barcode/.tempOutput/barcode42, 
/Results/Barcode/.tempOutput/barcode89, 
/Results/Barcode/.tempOutput/barcode45, 
/Results/Barcode/.tempOutput/barcode20, 
/Results/Barcode/.tempOutput/barcode18, 
/Results/Barcode/.tempOutput/barcode27, 
/Results/Barcode/.tempOutput/barcode11, 
.
.
.
.
.
    output: /Results/Barcode/barcode75.merged.fastq
    jobid: 82
    wildcards: folder=barcode75

rule merge的每个作业都需要相同的确切输入,总计约80个实例。但是,每个文件夹中每个作业中的wildcards部分是不同的。我该如何使用它作为我的rule merge每个实例的输入,而不是传递从def aggregate_barcode_folders收到的整个列表?

我认为rule all的输入可能有问题,但是我不确定100%可能是什么问题。

请注意,我知道snakemake会抛出一个错误,指出它正在等待rule merge的输出文件,因为除了将其打印到屏幕上之外,我没有对输出进行任何操作。

编辑

我已经决定暂时不使用检查点,而是选择以下内容。为了使事情更清楚,该管道的目标如下:我试图将fastq文件从输出文件夹合并为一个文件,而输入文件具有可变数量的文件(每个文件夹1到3个,但是我不知道有多少)。输入的结构如下

输入

|-- Results
    |-- FolderA
        |-- barcode01
            |-- file1.fastq
        |-- barcode02
            |-- file1.fastq
            |-- file2.fastq
        |-- barcode03
            |-- file1.fastq
    |-- FolderB
        |-- barcode01
            |-- file1.fastq
        |-- barcode02
            |-- file1.fastq
            |-- file2.fastq
        |-- barcode03
            |-- file1.fastq
    |-- FolderC
        |-- barcode01
            |-- file1.fastq
            |-- file2.fastq
        |-- barcode02
            |-- file1.fastq
        |-- barcode03
            |-- file1.fastq
            |-- file2.fastq

输出 我想将输出类似于以下内容:

|-- Results
    |-- barcode01.merged.fastq
    |-- barcode02.merged.fastq
    |-- barcode03.merged.fastq

输出文件将包含来自其各自的条形码文件夹中的所有file#.fastq,来自文件夹ABC的数据。

(我认为)我可以比以前走得更远,但是snakemake抛出了一个错误,提示Missing input files for rule basecall: /Users/joshl/PycharmProjects/ARS/Results/DataFiles/fast5/FAL03879_67a0761e_1055/ barcode72.fast5。我的代码相关代码在这里:

代码


configfile: "config.yaml"
FAST5_FILES = glob_wildcards(config['results_folder'] + "DataFiles/fast5/{fast5_files}.fast5").fast5_files

def return_fast5_folder_names():
    path = config['results_folder'] + "Basecall/"
    fast5_folder_names = []
    for item in os.scandir(path):
        if Path(item).is_dir():
            fast5_folder_names.append(item.name)

    return fast5_folder_names

def return_barcode_folder_names():
    path = config['results_folder'] + ".barcodeTempOutput"
    fast5_folder_names = []
    collated_barcode_folder_names = []

    for item in os.scandir(path):
        if Path(item).is_dir():
            full_item_path = os.path.join(path, item.name)
            fast5_folder_names.append(full_item_path)

    index = 0
    for item in fast5_folder_names:
        collated_barcode_folder_names.append([])
        for folder in os.scandir(item):
            if Path(folder).is_dir():
                collated_barcode_folder_names[index].append(folder.name)
        index += 1

    return collated_barcode_folder_names


rule all:
    input:
        # basecall
        expand(config['results_folder'] + "Basecall/{fast5_file}", fast5_file=FAST5_FILES),

         # barcode
        expand(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}", fast5_folders=return_fast5_folder_names()),

        # merge files
        expand(config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq", barcode_numbers=return_barcode_folder_names())

rule basecall:
    input:
         config['results_folder'] + "DataFiles/fast5/{fast5_file}.fast5"
    output:
        directory(config['results_folder'] + "Basecall/{fast5_file}")
    shell:
         r"""
         guppy_basecaller \
         --input_path {input} \
         --save_path {output} \
         --quiet \
         --config dna_r9.4.1_450bps_fast.cfg \
         --num_callers 2 \
         --cpu_threads_per_caller 6
         """

rule barcode:
    input:
        config['results_folder'] + "Basecall/{fast5_folders}"
    output:
        directory(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}")
    threads: 12
    shell:
         r"""
         for item in {input}; do
                guppy_barcoder \
                --input_path $item \
                --save_path {output} \
                --barcode_kits EXP-PBC096 \
                --recursive
         done         
         """

rule merge_files:
    input:
        expand(config['results_folder'] + ".barcodeTempOutput/" + "{fast5_folder}/{barcode_numbers}",
               fast5_folder=glob_wildcards(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").fast5_folders,
               barcode_numbers=glob_wildcards(config['results_folder'] +".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").barcode_numbers)
    output:
        config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq"
    shell:
        r"""
        echo "Hello world"
        echo {input}
        """

rule all下,如果我注释掉对应于合并文件的行,则没有错误

1 个答案:

答案 0 :(得分:1)

我没有完全理解您的意思,但我认为问题确实在于rule all的输入。我目前也无法访问计算机(我现在正在使用手机),所以我无法举一个真实的例子。.可能您想做的就是更改ReturnBarcodeFolderNames以使用检查点。我猜只有在rule barcode之后,您才真正知道要作为最终输出。

def ReturnBarcodeFolderNames(wildcards):
    # the wildcard here makes sure that barcode is executed first
    checkpoint_output = checkpoints.barcode.get().output[0]
    
    folder_names = []
    for root, directories, files in os.walk(checkpoint_output):
        for direc in directories:
            folder_names.append(direc)

    return expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=folder_names)


rule all:
    input:
        ReturnBarcodeFolderNames


rule merge:
    input:
        config['results_folder'] + "Barcode/.tempOutput/{folder}"
    output:
        config['results_folder'] + "Barcode/{folder}.merged.fastq"
    shell:
         "echo {input}"

很显然,ReturnBarcodeFolderNames不能以当前格式运行。但是,这样做的想法是在执行rule all之后,在rule barcode中检查所需的最终输出。然后,规则合并不必使用检查点,因为可以清楚地定义其输入和输出。

我希望这会有所帮助:),但也许我一直在解决您的问题以外的问题。不幸的是,这个问题对我来说还不是很清楚。


修改

这是该代码的简化版本,但是现在应该很容易实现最后一部分。它适用于您在示例中给出的文件夹结构:

import os
import glob


def get_merged_barcodes(wildcards):
    tmpdir = checkpoints.barcode.get(**wildcards).output[0]  # this forces the checkpoint to be executed before we continue
    barcodes = set()  # a set is like a list, but only stores unique values
    for folder in os.listdir(tmpdir):
        for barcode in os.listdir(tmpdir + "/" + folder):
            barcodes.add(barcode)

    mergedfiles = ["results/" + barcode + ".merged.fastq" for barcode in barcodes]
    return mergedfiles
    

rule all:
    input:
        get_merged_barcodes


checkpoint barcode:
    input:
        rules.basecall.output
    output:
        directory("results")
    shell:
        """
        stuff
        """


def get_merged_input(wildcards):
    return glob.glob(f"results/**/{wildcards.barcode}/*.fastq")



rule merge_files:
    input:
        get_merged_input
    output:
        "results/{barcode}.merged.fastq"
    shell:
        """
        echo {input}
        """

基本上,您在原始问题中所做的工作都差不多了!