Question

我当前的体系结构是在我的Snakefile的开头，我有一个运行时间较长的函数somefunc，该函数有助于确定对rule all的“输入”。我意识到，当我用口齿不清地运行工作流时，每个作业都在执行somefunc。我可以访问一些变量来定义代码是提交的作业还是主要流程：

if not snakemake.submitted_job:
    config['layout'] = somefunc()

...

Answer 1

我实际上不建议使用的解决方案是使somefunc将输入列表写入tmp文件，以便Slurm作业将读取此tmp文件，而不是从头开始重建列表。 tmp文件由首先执行的任何作业创建，因此长时间运行的部分仅执行一次。

在工作流程结束时，删除tmp文件，以便以后的执行将以新的输入重新开始。

这是草图：

def somefunc():
    try:
        all_output = open('tmp.txt').readlines()
        all_output = [x.strip() for x in all_output]
        print('List of input files read from tmp.txt')
    except:
        all_output = ['file1.txt', 'file2.txt'] # Long running part
        with open('tmp.txt', 'w') as fout:
            for x in all_output:
                fout.write(x + '\n')
        print('List of input files created and written to tmp.txt')
    return all_output

all_output = somefunc()

rule all:
    input:
        all_output,

rule one:
    output:
        all_output,
    shell:
        r"""
        touch {output}
        """

onsuccess:
    os.remove('tmp.txt')
onerror:
    os.remove('tmp.txt')

由于作业将并行提交，因此您应确保只有一个作业写入tmp.txt，而其他作业读取。我认为上面的try / except可以做到，但我不确定100％。（也许您想使用比tmp.txt更好的文件名，请参见模块tempfile。另请参见模块atexit）以获取退出处理程序）

Answer 2

正如与@dariober讨论的那样，检查（隐藏的）snakemake目录是否具有锁似乎是最干净的方法，因为它们似乎要等到第一个规则开始时才会生成（假设您未使用--nolock参数）。

import os
locked = len(os.listdir(".snakemake/locks")) > 0

但这会导致我的问题：

import time
import os


def longfunc():
    time.sleep(10)
    return range(5)

locked = len(os.listdir(".snakemake/locks")) > 0
if not locked:
    info = longfunc()


rule all:
    input:
        expand("test_{sample}", sample=info)



rule test:
    output:
        touch("test_{sample}")
    run:
        """
        sleep 1
        """

某种程度上，snakemake允许每个规则重新解释完整的snakefile，所有工作都会抱怨“信息未定义”。对我来说，最简单的方法是存储结果并为每个作业（pickle.dump和pickle.load）加载结果。

Snakemake：定义进程是提交集群作业还是蛇文件的变量

2 个答案: