单个规则Snakemake文件中的多个输入和输出

时间:2017-06-15 07:23:58

标签: python-3.x snakemake

我开始使用Snakemake,我有一个非常基本的问题,我无法在snakemake教程中找到答案。

我想创建一个单一规则snakefile来逐个下载linux中的多个文件。 '扩展'不能在输出中使用,因为文件需要逐个下载,并且不能使用通配符,因为它是目标规则。

我想到的唯一方法是这样的东西不能正常工作。我无法弄清楚如何将下载的项目发送到具有特定名称的特定目录,例如' downloaded_files.dwn'使用{output}在以后的步骤中使用:

links=[link1,link2,link3,....]
rule download:    
output: 
    "outdir/{downloaded_file}.dwn"
params: 
    shellCallFile='callscript',
run: 
    callString=''
    for item in links:
        callString+='wget str(item) -O '+{output}+'\n'
    call('echo "' + callString + '\n" >> ' + params.shellCallFile, shell=True)
    call(callString, shell=True)

我很欣赏任何关于如何解决这个问题的提示以及我不太了解的蛇形部分。

1 个答案:

答案 0 :(得分:5)

以下是一个可以帮助您解决问题的评论示例:

# Create some way of associating output files with links
# The output file names will be built from the keys: "chain_{key}.gz"
# One could probably directly use output file names as keys 
links = {
    "1" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
    "2" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
    "3" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}


rule download:
    output:
        # We inform snakemake that this rule will generate
        # the following list of files:
        # ["outdir/chain_1.gz", "outdir/chain_2.gz", "outdir/chain_3.gz"]
        # Note that we don't need to use {output} in the "run" or "shell" part.
        # This list will be used if we later add rules
        # that use the files generated by the present rule.
        expand("outdir/chain_{n}.gz", n=links.keys())
    run:
        # The sort is there to ensure the files are in the 1, 2, 3 order.
        # We could use an OrderedDict if we wanted an arbitrary order.
        for link_num in sorted(links.keys()):
            shell("wget {link} -O outdir/chain_{n}.gz".format(link=links[link_num], n=link_num))

这是另一种做法,它为下载的文件使用任意名称并使用output(虽然有点人为):

links = [
    ("foo_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz"),
    ("bar_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz"),
    ("baz_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz")]


rule download:
    output:
        # We inform snakemake that this rule will generate
        # the following list of files:
        # ["outdir/foo_chain.gz", "outdir/bar_chain.gz", "outdir/baz_chain.gz"]
        ["outdir/{f}".format(f=filename) for (filename, _) in links]
    run:
        for i in range(len(links)):
            # output is a list, so we can access its items by index
            shell("wget {link} -O {chain_file}".format(
                link=links[i][1], chain_file=output[i]))
        # using a direct loop over the pairs (filename, link)
        # could be considered "cleaner"
        # for (filename, link) in links:
        #     shell("wget {link} -0 outdir/{filename}".format(
        #         link=link, filename=filename))

可以使用snakemake -j 3并行完成三次下载的示例:

# To use os.path.join,
# which is more robust than manually writing the separator.
import os

# Association between output files and source links
links = {
    "foo_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
    "bar_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
    "baz_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}


# Make this association accessible via a function of wildcards
def chainfile2link(wildcards):
    return links[wildcards.chainfile]


# First rule will drive the rest of the workflow
rule all:
    input:
        # expand generates the list of the final files we want
        expand(os.path.join("outdir", "{chainfile}"), chainfile=links.keys())


rule download:
    output:
        # We inform snakemake what this rule will generate
        os.path.join("outdir", "{chainfile}")
    params:
        # using a function of wildcards in params
        link = chainfile2link,
    shell:
        """
        wget {params.link} -O {output}
        """