接受与snakemake规则略有不同的输入(.fq vs .fq.gz)

时间:2017-09-06 03:57:42

标签: bioinformatics snakemake

我是snakemake的新手,希望能够获取一对.fq个文件或一对.fq.gz个文件并通过trim_galore运行它们以获得一对修剪.fq.gz输出文件。如果没有提供我的所有Snakefile,我就会有一个下面丑陋的解决方案,我只是复制了规则并更改了输入。什么是更好的解决方案?

#Trim galore paired end trimming rule for unzipped fastqs:
rule trim_galore_unzipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq'),
        r2=join(config['fq_in_path'], '{sample}2.fq'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell:
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

#Trim galore paired end trimming rule for gzipped fastqs:
rule trim_galore_zipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq.gz'),
        r2=join(config['fq_in_path'], '{sample}2.fq.gz'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell: 
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

1 个答案:

答案 0 :(得分:2)

使用输入函数可能是最佳解决方案,如下所示:

  1. 将通配符传递给输入功能
  2. 使用已知的YAML值,使用该样本名称构建理论文件名。
  3. 使用python函数检查哪个文件(技术上的文件后缀)有效
  4. 构建有效文件列表
  5. 返回并解压缩有效文件列表。
  6. 备注:

    • 输入和输出应该具有相同的通配符,如果它们不会导致问题
    • 在输入函数中,确保它不能返回空字符串,因为Snakemake将其解释为"缺少输入"要求,这不是你想要的。
    • 如果您采纳这些建议,请更新规则名称,我忘了。

    <强> Snakefile:

     configfile: "config.yaml"
    
     from os.path import join
     from os.path import exists
    
     rule all:
         input:
             expand("{trim_out_path}/{sample}.{readDirection}.fq.gz",
                 trim_out_path=config["trim_out_path"],
                 sample=config["sampleList"],
                 readDirection=['1','2'])
    
    
     def trim_galore_input_determination(wildcards):
         potential_file_path_list = []
         # Cycle through both suffix possibilities:
         for fastqSuffix in [".fq", ".fq.gz"]:
    
             # Cycle through both read directions
             for readDirection in ['.1','.2']:
    
                 #Build the list for ech suffix
                 potential_file_path = config["fq_in_path"] + "/" + wildcards.sample + readDirection + fastqSuffix
    
                 #Check if this file actually exists
                 if exists(potential_file_path):
    
                     #If file is legit, add to list of acceptable files
                     potential_file_path_list.append(potential_file_path)
    
         # Checking for an empty list
         if len(potential_file_path_list):
             return potential_file_path_list
         else:
             return ["trim_galore_input_determination_FAILURE" + wildcards.sample]
    
     rule trim_galore_unzipped_PE:
         input:
             unpack(trim_galore_input_determination)
         output:
             expand("{trim_out_path}/{{sample}}.{readDirection}.fq.gz",
                 trim_out_path=config["trim_out_path"],
                 readDirection=['1','2'])
         params:
             out_path=config['trim_out_path'],
         conda:
             'envs/biotools.yaml',
         shell:
             'trim_galore --gzip -o {params.out_path} --paired {input}'
    

    <强> config.yaml:

    fq_in_path: input/fq
    trim_out_path: output
    sampleList: ["mySample1", "mySample2"]
    

    <强> $树

    |-- [tboyarsk      1540 Sep  6 15:17]  Snakefile
    |-- [tboyarsk        82 Sep  6 15:17]  config.yaml
    |-- [tboyarsk       512 Sep  6  8:55]  input
    |   |-- [tboyarsk       512 Sep  6  8:33]  fq
    |   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq
    |   |   |-- [tboyarsk         0 Sep  6  8:24]  mySample1.2.fq
    |   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample2.1.fq
    |   |   `-- [tboyarsk         0 Sep  6  8:24]  mySample2.2.fq
    |   `-- [tboyarsk       512 Sep  6  8:55]  fqgz
    |       |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq.gz
    |       |-- [tboyarsk         0 Sep  6  8:32]  mySample1.2.fq.gz
    |       |-- [tboyarsk         0 Sep  6  8:33]  mySample2.1.fq.gz
    |       `-- [tboyarsk         0 Sep  6  8:32]  mySample2.2.fq.gz
    `-- [tboyarsk       512 Sep  6  7:55]  output
    

    $ snakemake -dry(输入:fg)

     rule trim_galore_unzipped_PE:
         input: input/fq/mySample1.1.fq, input/fq/mySample1.2.fq
         output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
         jobid: 1
         wildcards: sample=mySample1
    
    
     rule trim_galore_unzipped_PE:
         input: input/fq/mySample2.1.fq, input/fq/mySample2.2.fq
         output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
         jobid: 2
         wildcards: sample=mySample2
    
    
     localrule all:
         input: output/mySample1.1.fq.gz, output/mySample2.1.fq.gz, output/mySample1.2.fq.gz, output/   mySample2.2.fq.gz
         jobid: 0
    
     Job counts:
             count   jobs
             1       all
             2       trim_galore_unzipped_PE
             3
    

    $ snakemake -dry(输入:fgqz)

     rule trim_galore_unzipped_PE:
         input: input/fqgz/mySample1.1.fq.gz, input/fqgz/mySample1.2.fq.gz
         output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
         jobid: 1
         wildcards: sample=mySample1
    
    
     rule trim_galore_unzipped_PE:
         input: input/fqgz/mySample2.1.fq.gz, input/fqgz/mySample2.2.fq.gz
         output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
         jobid: 2
         wildcards: sample=mySample2
    
    
     localrule all:
         input: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz, output/mySample2.1.fq.gz, output/   mySample2.2.fq.gz
         jobid: 0
    
     Job counts:
             count   jobs
             1       all
             2       trim_galore_unzipped_PE
             3
    

    有一些方法可以使它更通用,但由于您声明并使用YAML配置来构建大部分文件名,我将避免在答案中讨论它。只是说它可能并且有点鼓励。

    &#34; - 配对{input}&#34;将扩展以提供这两个文件。由于for循环,1将始终位于2之前。