在snakemake中使用多个参数

时间:2017-01-20 16:16:27

标签: python-3.x snakemake

我刚刚开始使用snakemake,并且想知道"正确的"是什么?在同一个文件上运行一组参数的方法以及它如何用于链接规则?

因此,例如,当我想要多个规范化方法时,接下来让我们说一个具有不同数量的k个聚类的聚类规则。 这样做的最佳方法是什么,以便运行所有组合?

我开始这样做了:

INFILES = ["mytable"]

rule preprocess:
input:
    bam=expand("data/{sample}.csv", sample=INFILES, param=config["normmethod"])

output:
    bamo=expand("results/{sample}_pp_{param}.csv", sample=INFILES, param=config["normmethod"])

script:
    "scripts/preprocess.py"

然后通过以下方式调用脚本:

  

snakemake --config normmethod =中位数

但是,在工作流程的后期,这并没有真正扩展到更多选项。例如,我如何自动合并这些选项?

normmethods= ["Median", "Quantile"]
kclusters= [1,3,5,7,10]

3 个答案:

答案 0 :(得分:5)

好像你没有将params传递给你的脚本。如下所示呢?

import re
import os
import glob
normmethods= ["Median", "Quantile"] # can be set from config['normmethods']    
kclusters= [1,3,5,7,10]             # can be set from config['kclusters']
INFILES = ['results/' + re.sub('\.csv$', '_pp_' + m + '-' + str(k) + '.csv', re.sub('data/', '', file)) for file in glob.glob("data/*.csv") for m in normmethods for k in kclusters]

rule cluster:
    input: INFILES

rule preprocess:
    input:
        bam="data/{sample}.csv"
    output:
        bamo="results/{sample}_pp_{m}-{k}.csv"
    run:     
        os.system("scripts/preprocess.py %s %s %s %s" % (input.bame, output.bamo, wildcards.m, wildcards.k))

答案 1 :(得分:5)

您在规则中使用expand()函数做得很好。

对于参数,我建议使用包含所有参数的配置文件。 Snakemake与YAML& JSON文件。在这里,您可以获得有关这两种格式的所有信息:

在你的情况下,你只需要在YAML文件中写这个:

INFILES : "mytables"

normmethods : ["Median", "Quantile"] 
or
normmethods : - "Median"
              - "Quantile"

kclusters : [1,3,5,7,10]
or
kclusters : - 1
            - 3
            - 5
            - 7
            - 10

像这样写下你的规则:

rule preprocess:
input:
    bam = expand("data/{sample}.csv",
                 sample = config["INFILES"])

params :
    kcluster = config["kcluster"]

output:
    bamo = expand("results/{sample}_pp_{method}_{cluster}.csv",
                  sample = config["INFILES"],
                  method = config["normmethod"],
                  cluster = config["kcluster"])

script:
    "scripts/preprocess.py {input.bam} {params.kcluster}"

那你只需要像这样吃午饭:

snakemake --configfile  path/to/config.yml

要与其他参数一起运行,您必须修改配置文件,而不是修改snakefile(减少错误),并且更好的是可读性和代码美。

编辑:

  rule preprocess:
    input:
      bam = "data/{sample}.csv"

只是为了纠正我自己的错误,你不需要在输入上使用expand,因为你只想将规则运行一个文件.csv一个。所以只需将通配符放在这里,Snakemake就会尽力而为。

答案 2 :(得分:1)

这个答案类似于@ Shiping的答案,即在规则的this.$route.query.reportId (or this.$route.params.reportId, I don't remember) 中使用通配符来为每个输入文件实现多个参数。但是,这个答案提供了一个更详细的示例,并避免使用复杂的列表理解,正则表达式或output模块。

@Pereira Hugo的方法使用一个作业来为一个输入文件运行所有参数组合,而本回答中的方法使用一个作业为一个输入文件运行一个参数组合,这使得更容易并行化在一个输入文件上执行每个参数组合。

glob

Snakefile

运行import os data_dir = 'data' sample_fns = os.listdir(data_dir) sample_pfxes = list(map(lambda p: p[:p.rfind('.')], sample_fns)) res_dir = 'results' params1 = [1, 2] params2 = ['a', 'b', 'c'] rule all: input: expand(os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv'), sample=sample_pfxes, param1=params1, param2=params2) rule preprocess: input: csv=os.path.join(data_dir, '{sample}.csv') output: csv=os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv') shell: "ls {input.csv} && \ echo P1: {wildcards.param1}, P2: {wildcards.param2} > {output.csv}" 之前的目录结构:

snakemake

运行$ tree . . ├── Snakefile ├── data │   ├── sample_1.csv │   ├── sample_2.csv │   └── sample_3.csv └── results

snakemake

运行$ snakemake -p Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 Rules claiming more threads will be scaled down. Job counts: count jobs 1 all 18 preprocess 19 rule preprocess: input: data/sample_1.csv output: results/sample_1_p1_2_p2_a.csv jobid: 1 wildcards: param2=a, sample=sample_1, param1=2 ls data/sample_1.csv && echo P1: 2, P2: a > results/sample_1_p1_2_p2_a.csv data/sample_1.csv Finished job 1. 1 of 19 steps (5%) done rule preprocess: input: data/sample_2.csv output: results/sample_2_p1_2_p2_a.csv jobid: 2 wildcards: param2=a, sample=sample_2, param1=2 ls data/sample_2.csv && echo P1: 2, P2: a > results/sample_2_p1_2_p2_a.csv data/sample_2.csv Finished job 2. 2 of 19 steps (11%) done ... localrule all: input: results/sample_1_p1_1_p2_a.csv, results/sample_1_p1_2_p2_a.csv, results/sample_2_p1_1_p2_a.csv, results/sample_2_p1_2_p2_a.csv, results/sample_3_p1_1_p2_a.csv, results/sample_3_p1_2_p2_a.csv, results/sample_1_p1_1_p2_b.csv, results/sample_1_p1_2_p2_b.csv, results/sample_2_p1_1_p2_b.csv, results/sample_2_p1_2_p2_b.csv, results/sample_3_p1_1_p2_b.csv, results/sample_3_p1_2_p2_b.csv, results/sample_1_p1_1_p2_c.csv, results/sample_1_p1_2_p2_c.csv, results/sample_2_p1_1_p2_c.csv, results/sample_2_p1_2_p2_c.csv, results/sample_3_p1_1_p2_c.csv, results/sample_3_p1_2_p2_c.csv jobid: 0 Finished job 0. 19 of 19 steps (100%) done 后的目录结构:

snakemake

示例结果:

$ tree .                                                                                                                                       [18:51:12]
.
├── Snakefile
├── data
│   ├── sample_1.csv
│   ├── sample_2.csv
│   └── sample_3.csv
└── results
    ├── sample_1_p1_1_p2_a.csv
    ├── sample_1_p1_1_p2_b.csv
    ├── sample_1_p1_1_p2_c.csv
    ├── sample_1_p1_2_p2_a.csv
    ├── sample_1_p1_2_p2_b.csv
    ├── sample_1_p1_2_p2_c.csv
    ├── sample_2_p1_1_p2_a.csv
    ├── sample_2_p1_1_p2_b.csv
    ├── sample_2_p1_1_p2_c.csv
    ├── sample_2_p1_2_p2_a.csv
    ├── sample_2_p1_2_p2_b.csv
    ├── sample_2_p1_2_p2_c.csv
    ├── sample_3_p1_1_p2_a.csv
    ├── sample_3_p1_1_p2_b.csv
    ├── sample_3_p1_1_p2_c.csv
    ├── sample_3_p1_2_p2_a.csv
    ├── sample_3_p1_2_p2_b.csv
    └── sample_3_p1_2_p2_c.csv