编辑1

Question

我有cram（bam）文件，我想按读取组拆分。这需要读取标题并提取读取的组ID。

我有这个函数在我的Snakemake文件中执行此操作：

def identify_read_groups(cram_file):
    import subprocess
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.split('\n')[:-1]
    return(read_groups)

我有这个规则：

rule all:
input:
    expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))

这个规则实际上是分裂：

rule split_cram_by_rg:
input:
    cram_file='cram/{sample}.bam.cram',
    read_groups=identify_read_groups('cram/{sample}.bam.cram')
output:
    'cram/RG_bams/{sample}.RG{read_groups}.bam'
run:
    import subprocess
    read_groups = open(input.readGroupIDs).readlines()
    read_groups = [str(rg.replace('\n','')) for rg in read_groups]
    for rg in read_groups:
        command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
        subprocess.check_output(command, shell=True)

我在干跑时遇到这个错误：

[E::hts_open_format] fail to open file 'cram/{sample}.bam.cram'
samtools view: failed to open "cram/{sample}.bam.cram" for reading: No such file or directory
TypeError in line 19 of /gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile:
a bytes-like object is required, not 'str'
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 37, in <module>
  File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 19, in identify_read_groups

{sample}未传递给该函数。

我该如何解决这个问题？如果我不是以'snakemake-ic'的方式做这件事，我会接受其他方法。

==============

编辑1

好的，我给出的第一组例子有很多问题。

这是一组更好的（？）代码，我希望能够证明我的问题。

import sys
from os.path import join

shell.prefix("set -eo pipefail; ")

def identify_read_groups(wildcards):
    import subprocess
    cram_file = 'cram/' + wildcards + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups)

SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for i in SAMPLES:
    RG_dict[i] = identify_read_groups(i)

rule all:
    input:
        expand('{sample}.boo.txt', sample=list(RG_dict.keys()))

rule split_cram_by_rg:
    input:
        file='cram/{sample}.bam.cram',
        RG = lambda wildcards: RG_dict[wildcards.sample]
    output:
        expand('cram/RG_bams/{{sample}}.RG{input_RG}.bam') # I have a problem HERE. How can I get my read groups values applied here? I need to go from one cram to multiple bam files split by RG (see -r in samtools view below). It can't pull the RG from the input.
    shell:
        'samtools view -b -r {input.RG} {input.file} > {output}'


rule merge_RG_bams_into_one_bam:
    input:
        rules.split_cram_by_rg.output
    output:
        '{sample}.boo.txt'
    message:
        'echo {input}'
    shell:
        'samtools merge {input} > {output}' #not working
        """

==============

编辑2

越来越接近，但目前正在努力扩展正确构建通道bam文件并保留通配符

我正在使用此循环创建中间文件名：

for sample in SAMPLES:
    for rg_id in list(return_ID(sample)):
        out_rg_bam.append("temp/lane_bam/{}.ID{}.bam".format(sample, rg_id))

return_ID是一个函数，它接收样本通配符并返回样本包含的读取组列表

如果我使用out_rg_bam作为合并规则的输入，则所有文件将合并为合并的bam，而不是由sample拆分。

如果我使用expand('temp/realigned/{{sample}}.ID{rg_id}.realigned.bam', sample=SAMPLES, rg_id = return_ID(sample))，那么rg_id将应用于每个样本。所以，如果我有两个样本（a，b），读组（0,1）和（0,1,2），我最终得到a0，a1，a0，a1，a2和b0，b1，b0，b1 ，b2。

Answer 1

我将给出一个更一般的答案来帮助那些可能找到这个帖子的人。当直接列出字符串时，Snakemake仅将“通配符”应用于“input”和“output”部分中的字符串，例如：

input:
    '{sample}.bam'

如果你想使用像你这样的功能：

input:
    read_groups=identify_read_groups('cram/{sample}.bam.cram')

不会进行通配符替换。您可以使用lambda函数并自行进行替换：

input:
    read_groups=lambda wildcards: identify_read_groups('cram/{sample}.bam.cram'.format(sample=wildcards.sample))

Answer 2

您的所有规则都不能包含通配符。这是一个没有通配符的区域。

编辑1

我在Notepad ++中输入了这个伪代码，它不是为了编译，只是试图提供一个框架。我认为这更像是你所追求的。

使用expand中的函数生成文件名列表，然后将其用于驱动Snakemake管道的所有规则。 baseSuffix和basePrefix变量只是为了让你了解String传递，这里允许参数。传回字符串列表时，您必须解压缩它们以确保Snakemake正确读取结果。

def getSampleFileList(String basePrefix, String baseSuffix){
    myFileList = []
    ListOfSamples = *The wildcard glob call*
    for sample in ListOfSamples:
        command = "samtools -h " + sample + "SAME CALL USED TO GENERATE LIST OF HEADERS"
        for rg in command:
            myFileList.append(basePrefix + sample + ".RG" + rg + baseSuffix)
}


basePreix = "cram/RG_bams/"
baseSuffix = ".bam" 

rule all:
    input:
        unpack(expand("{fileName}", fileName=getSampleFileList(basePrefix, baseSuffix)))


rule processing_rg_files:
    input:
        'cram/RG_bams/{sample}.RG{read_groups}.bam'
    output:
        'cram/RG_TXTs/{sample}.RG{read_groups}.txt'
    run:
        "Let's pretend this is useful code"

END OF EDIT

如果它不在所有规则中，则使用内联函数

所以我不确定你要完成什么。根据我的猜测，请阅读以下有关您的代码的一些注释。

rule all:
input:
    expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))

在规则全部调用中调用函数“identify_read_groups”时，空运行失败。它作为字符串而不是通配符传递给函数调用。

从技术上讲，如果samtools调用没有失败，并且函数调用“identify_read_groups（cram_file）”返回了一个包含5个字符串的列表，它将扩展为如下所示：

rule all:
    input:
        'cram/RG_bams/{sample}.RG<output1FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output2FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output3FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output4FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output5FromFunctionCall>.bam'

但是，在Snakemake的预处理阶段，术语“{sample}”被认为是一个字符串。因为您需要使用{{}}在扩展函数中表示通配符。

看看我如何解决我为规则所有输入调用声明的每个Snakemake变量，并且不使用通配符：

expand("{outputDIR}/{pathGVCFT}/tables/{samples}.{vcfProgram}.{form[1][varType]}{form[1][annotated]}.txt", outputDIR=config["outputDIR"], pathGVCFT=config["vcfGenUtil_varScanDIR"], samples=config["sample"], vcfProgram=config["vcfProgram"], form=read_table(StringIO(config["sampleFORM"]), " ").iterrows())

在这种情况下，read_table返回二维数组以形成。 Snakemake受到python的很好支持。我需要这个用于将不同注释配对到不同的变体类型。

您的规则都需要是字符串或字符串列表作为输入。您的“所有”规则中不能包含通配符。这些规则所有输入字符串都是Snakemake用于为OTHER通配符生成匹配的字符串。在函数调用中构建整个文件名，并在需要时返回它。

我认为你应该把它变成这样的东西：

rule all:
input:
    expand("{fileName}", fileName=myFunctionCall(BecauseINeededToPass, ACoupleArgs))

另外考虑将此更新为更通用。：

rule split_cram_by_rg:
    input:
        cram_file='cram/{sample}.bam.cram',
        read_groups=identify_read_groups('cram/{sample}.bam.cram')

它可以有两个或更多的通配符（为什么我们喜欢Snakemake）。您可以稍后通过通配符对象访问python“run”指令中的通配符，因为它看起来像是您想要在每个循环中使用。我认为输入和输出通配符必须匹配，所以也许也可以这样尝试。

rule split_cram_by_rg:
    input:
        'cram/{sample}.bam.cram'
    output:
        expand('cram/RG_bams/{{sample}}.RG{read_groups}.bam', read_groups=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
    ...
    params:
          rg=myFunctionCall(BecauseINeededToPass, ACoupleArgs)
    run:
        command = 'Just an example ' +  + str(params.rg)

再一次，不是很确定你要做什么，我不确定我喜欢两次函数调用的想法，但是嘿，它会运行; P还要注意使用通配符“sample”in字符串{}内的输入指令以及展开{{}}内的输出指令。

An example of accessing wildcards in your run directive

Example of function calls in places you wouldn't think. I grabbed VCF fields but it could have been anything.我在这里使用外部配置文件。

Answer 3

试试这个：我使用id = 0,1,2,3来命名输出bam文件，具体取决于bam文件的读取组数。

## this is a regular function which takes the cram file, and get the read-group to 
## construct your rule all
## you actually just need the number of @RG, below can be simplified  
def get_read_groups(sample):
    import subprocess
    cram_file = 'cram/' + sample + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups)

SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for sample in SAMPLES:
    RG_dict[sample] = get_read_groups(sample)

outbam = []
for sample in SAMPLES:
    read_groups = RG_dict[sample]
    for i in range(len(read_groups)):
        outbam.append("{}.RG{}.bam".format(sample, id))


rule all:
    input:
        outbam

## this is the input function, only uses wildcards as argument 
def identify_read_groups(wildcards):
    import subprocess
    cram_file = 'cram/' + wildcards.sample + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups[wildcards.id])

rule split_cram_by_rg:
    input:
        cram_file='cram/{sample}.bam.cram',
        read_groups=identify_read_groups
    output:
        'cram/RG_bams/{sample}.RG{id}.bam'  
    run:
        import subprocess
        read_groups = input.read_groups
        for rg in read_groups:
            command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
            subprocess.check_output(command, shell=True)

当使用snakemake时，想想自下而上的方式。首先在规则all中定义要生成的内容，然后构造规则以创建最终的all。

Snakemake：如何使用带有通配符并返回值的函数？

编辑1

编辑2

3 个答案: