Question

我的问题与Running parallel instances of a single job/rule on Snakemake有关，但我相信与众不同。

我不能事先为其创建一个all：规则，因为输入文件的文件夹将由先前的规则创建，并且取决于用户的初始数据

伪代码

规则1：获取一个大文件（确定）
规则2：在“拆分”文件夹中将文件拆分为多个部分（确定）
规则3：对在Split中创建的每个文件运行一个程序

我现在处于Rule3中，其中包含70个文件，例如分割/file_001.fq 分割/file_002.fq .. 分割/file_069.fq

您能帮我为Pigz创建一条规则来并行运行70个文件和70个.gz文件

我正在使用snakemake -j 24 ZipSplit

config [“ pigt”]为每个压缩作业提供4个线程，而我给蛇形制作提供24个线程，因此我希望进行6次并行压缩，但是我当前的规则是将输入合并到一个存档中，而不是并行化！ >

我应该在规则中完全构建输入列表吗？怎么样？

# parallel job
files, = glob_wildcards("Split/{x}.fq")

rule ZipSplit:
    input: expand("Split/{x}.fq", x=files)
    threads: config["pigt"]
    shell: 
      """
      pigz -k -p {threads} {input}
      """

我试图直接用

定义输入

input: glob_wildcards("Split/{x}.fq")

但是发生语法错误

# InSilico_PCR Snakefile

import os
import re
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider

HTTP = HTTPRemoteProvider()

# source config variables
configfile: "config.yaml"


# single job
rule GetRawData:
    input:
      HTTP.remote(os.path.join(config["host"], config["infile"]), keep_local=True, allow_redirects=True)
    output:
      os.path.join("RawData", config["infile"])
    run:
      shell("cp {input} {output}")


# single job
rule SplitFastq:
    input:
      os.path.join("RawData", config["infile"])
    params:
      lines_per_file =  config["lines_per_file"]
    output:
      pfx = os.path.join("Split", config["infile"] + "_")
    shell:
      """
      zcat {input} | split --numeric-suffixes --additional-suffix=.fq -a 3 -l {params.lines_per_file} - {output.pfx}
      """

# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
    input: expand("Split/{x}.fq", x=files)
    threads: config["pigt"]
    shell: 
      """
      pigz -k -p {threads} {input}
      """

Answer 1

我认为下面的示例应该使用@ Maarten-vd-Sande建议的检查点来做到这一点。

但是，在特殊情况下，要分割一个大文件并即时压缩输出，最好像使用

一样使用--filter的{{1}}选项

split

snakemake解决方案，假设您的输入文件名为split -a 3 -d -l 4 --filter='gzip -c > $FILE.fastq.gz' bigfile.fastq split/，则拆分和压缩输出将位于目录bigfile.fastq

splitting./bigfile/

snakemake从文件夹中的所有文件并行运行单个作业

1 个答案: