Question

这里是没有经验的，自学成才的“编码员”，所以请谅解：]

我正在尝试学习和使用Snakemake构建用于分析的管道。不幸的是，我无法同时运行单个作业/规则的多个实例。我的工作站不是计算集群，因此无法使用此选项。我找了几个小时的答案，但要么没有答案，要么我不了解。因此：有没有一种方法可以同时运行单个作业/规则的多个实例？

如果您想举一个具体的例子：

让我们说我想使用fastqc工具分析一组4个.fastq文件。所以我输入了一条命令：

time snakemake -j 32

然后运行我的代码，即：

SAMPLES, = glob_wildcards("{x}.fastq.gz")

rule Raw_Fastqc:
    input:
            expand("{x}.fastq.gz", x=SAMPLES)
    output:
            expand("./{x}_fastqc.zip", x=SAMPLES),
            expand("./{x}_fastqc.html", x=SAMPLES)
    shell:
            "fastqc {input}"

我希望snakemake能够在32个线程上运行尽可能多的fastqc实例（因此很容易同时运行所有4个输入文件）。事实上。此命令大约需要12分钟才能完成。同时，从snakemake内部利用GNU并行

shell:
    "parallel fastqc ::: {input}"

我在3分钟内得到结果。显然，这里还有一些未开发的潜力。

谢谢！

Answer 1

如果我没记错的话，fastqc将分别在每个fastq文件上工作，因此您的实现没有利用snakemake的并行化功能。可以通过defining the targets完成，如下所示，使用rule all。

from pathlib import Path

SAMPLES = [Path(f).name.replace('.fastq.gz', '')  for f in glob_wildcards("{x}.fastq.gz") ]

rule all:
    input:
        expand("./{sample_name}_fastqc.{ext}", 
                        sample_name=SAMPLES, ext=['zip', 'html'])

rule Raw_Fastqc:
    input:
            "{x}.fastq.gz", x=SAMPLES
    output:
            "./{x}_fastqc.zip", x=SAMPLES,
            "./{x}_fastqc.html", x=SAMPLES
    shell:
            "fastqc {input}"

Answer 2

要添加到上述JeeYem的答案中，您还可以使用每个规则的'threads' property来定义为每个作业保留的资源数量，如下所示：

rule Raw_Fastqc:
input:
        "{x}.fastq.gz", x=SAMPLES
output:
        "./{x}_fastqc.zip", x=SAMPLES,
        "./{x}_fastqc.html", x=SAMPLES
threads: 4
shell:
        "fastqc --threads {threads} {input}"

由于fastqc本身可以为每个任务使用多个线程，因此您甚至可以通过parallel实现获得更多的加速。

然后，Snakemake将自动分配尽可能多的作业，这些作业可以满足顶级调用提供的总线程数：

例如，

snakemake -j 32将执行多达8个Raw_Fastqc规则的实例。

在Snakemake上运行单个作业/规则的并行实例

2 个答案: