我最近在checkpoints
上遇到过蛇行制造,并意识到它们可以与我正在尝试的工作完美配合。我已经能够实现工作流程listed here。我还发现了this stackoverflow question,但对它的含义或我如何使其适合我的工作一无所知
我正在使用的规则如下:
def ReturnBarcodeFolderNames():
path = config['results_folder'] + "Barcode/"
return_direc = []
for root, directory, files in os.walk(path):
for direc in directory:
return_direc.append(direc)
return return_direc
rule all:
input:
expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=ReturnBarcodeFolderNames())
checkpoint barcode:
input:
expand(config['results_folder'] + "Basecall/{fast5_files}", fast5_files=FAST5_FILES)
output:
temp(directory(config['results_folder'] + "Barcode/.tempOutput/"))
shell:
"guppy_barcoder "
"--input_path {input} "
"--save_path {output} "
"--barcode_kits EXP-PBC096 "
"--recursive"
def aggregate_barcode_folders(wildcards):
checkpoint_output = checkpoints.barcode.get(**wildcards).output[0]
folder_names = []
for root, directories, files in os.walk(checkpoint_output):
for direc in directories:
folder_names.append(direc)
return expand(config['results_folder'] + "Barcode/.tempOutput/{folder}", folder=folder_names)
rule merge:
input:
aggregate_barcode_folders
output:
config['results_folder'] + "Barcode/{folder}.merged.fastq"
shell:
"echo {input}"
rule barcode
和def aggregate_barcode_folders
可以正常工作,但是到达rule merge
时,每个输入文件夹都将传递到规则的每个实例。结果如下:
rule merge:
input: /Results/Barcode/.tempOutput/barcode81,
/Results/Barcode/.tempOutput/barcode28,
/Results/Barcode/.tempOutput/barcode17,
/Results/Barcode/.tempOutput/barcode10,
/Results/Barcode/.tempOutput/barcode26,
/Results/Barcode/.tempOutput/barcode21,
/Results/Barcode/.tempOutput/barcode42,
/Results/Barcode/.tempOutput/barcode89,
/Results/Barcode/.tempOutput/barcode45,
/Results/Barcode/.tempOutput/barcode20,
/Results/Barcode/.tempOutput/barcode18,
/Results/Barcode/.tempOutput/barcode27,
/Results/Barcode/.tempOutput/barcode11,
.
.
.
.
.
output: /Results/Barcode/barcode75.merged.fastq
jobid: 82
wildcards: folder=barcode75
rule merge
的每个作业都需要相同的确切输入,总计约80个实例。但是,每个文件夹中每个作业中的wildcards
部分是不同的。我该如何使用它作为我的rule merge
每个实例的输入,而不是传递从def aggregate_barcode_folders
收到的整个列表?
我认为rule all
的输入可能有问题,但是我不确定100%可能是什么问题。
请注意,我知道snakemake会抛出一个错误,指出它正在等待rule merge
的输出文件,因为除了将其打印到屏幕上之外,我没有对输出进行任何操作。
我已经决定暂时不使用检查点,而是选择以下内容。为了使事情更清楚,该管道的目标如下:我试图将fastq文件从输出文件夹合并为一个文件,而输入文件具有可变数量的文件(每个文件夹1到3个,但是我不知道有多少)。输入的结构如下
输入
|-- Results
|-- FolderA
|-- barcode01
|-- file1.fastq
|-- barcode02
|-- file1.fastq
|-- file2.fastq
|-- barcode03
|-- file1.fastq
|-- FolderB
|-- barcode01
|-- file1.fastq
|-- barcode02
|-- file1.fastq
|-- file2.fastq
|-- barcode03
|-- file1.fastq
|-- FolderC
|-- barcode01
|-- file1.fastq
|-- file2.fastq
|-- barcode02
|-- file1.fastq
|-- barcode03
|-- file1.fastq
|-- file2.fastq
输出 我想将输出类似于以下内容:
|-- Results
|-- barcode01.merged.fastq
|-- barcode02.merged.fastq
|-- barcode03.merged.fastq
输出文件将包含来自其各自的条形码文件夹中的所有file#.fastq
,来自文件夹A
,B
和C
的数据。
(我认为)我可以比以前走得更远,但是snakemake抛出了一个错误,提示Missing input files for rule basecall: /Users/joshl/PycharmProjects/ARS/Results/DataFiles/fast5/FAL03879_67a0761e_1055/ barcode72.fast5
。我的代码相关代码在这里:
代码
configfile: "config.yaml"
FAST5_FILES = glob_wildcards(config['results_folder'] + "DataFiles/fast5/{fast5_files}.fast5").fast5_files
def return_fast5_folder_names():
path = config['results_folder'] + "Basecall/"
fast5_folder_names = []
for item in os.scandir(path):
if Path(item).is_dir():
fast5_folder_names.append(item.name)
return fast5_folder_names
def return_barcode_folder_names():
path = config['results_folder'] + ".barcodeTempOutput"
fast5_folder_names = []
collated_barcode_folder_names = []
for item in os.scandir(path):
if Path(item).is_dir():
full_item_path = os.path.join(path, item.name)
fast5_folder_names.append(full_item_path)
index = 0
for item in fast5_folder_names:
collated_barcode_folder_names.append([])
for folder in os.scandir(item):
if Path(folder).is_dir():
collated_barcode_folder_names[index].append(folder.name)
index += 1
return collated_barcode_folder_names
rule all:
input:
# basecall
expand(config['results_folder'] + "Basecall/{fast5_file}", fast5_file=FAST5_FILES),
# barcode
expand(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}", fast5_folders=return_fast5_folder_names()),
# merge files
expand(config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq", barcode_numbers=return_barcode_folder_names())
rule basecall:
input:
config['results_folder'] + "DataFiles/fast5/{fast5_file}.fast5"
output:
directory(config['results_folder'] + "Basecall/{fast5_file}")
shell:
r"""
guppy_basecaller \
--input_path {input} \
--save_path {output} \
--quiet \
--config dna_r9.4.1_450bps_fast.cfg \
--num_callers 2 \
--cpu_threads_per_caller 6
"""
rule barcode:
input:
config['results_folder'] + "Basecall/{fast5_folders}"
output:
directory(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}")
threads: 12
shell:
r"""
for item in {input}; do
guppy_barcoder \
--input_path $item \
--save_path {output} \
--barcode_kits EXP-PBC096 \
--recursive
done
"""
rule merge_files:
input:
expand(config['results_folder'] + ".barcodeTempOutput/" + "{fast5_folder}/{barcode_numbers}",
fast5_folder=glob_wildcards(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").fast5_folders,
barcode_numbers=glob_wildcards(config['results_folder'] +".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").barcode_numbers)
output:
config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq"
shell:
r"""
echo "Hello world"
echo {input}
"""
在rule all
下,如果我注释掉对应于合并文件的行,则没有错误
答案 0 :(得分:1)
我没有完全理解您的意思,但我认为问题确实在于rule all
的输入。我目前也无法访问计算机(我现在正在使用手机),所以我无法举一个真实的例子。.可能您想做的就是更改ReturnBarcodeFolderNames
以使用检查点。我猜只有在rule barcode
之后,您才真正知道要作为最终输出。
def ReturnBarcodeFolderNames(wildcards):
# the wildcard here makes sure that barcode is executed first
checkpoint_output = checkpoints.barcode.get().output[0]
folder_names = []
for root, directories, files in os.walk(checkpoint_output):
for direc in directories:
folder_names.append(direc)
return expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=folder_names)
rule all:
input:
ReturnBarcodeFolderNames
rule merge:
input:
config['results_folder'] + "Barcode/.tempOutput/{folder}"
output:
config['results_folder'] + "Barcode/{folder}.merged.fastq"
shell:
"echo {input}"
很显然,ReturnBarcodeFolderNames
不能以当前格式运行。但是,这样做的想法是在执行rule all
之后,在rule barcode
中检查所需的最终输出。然后,规则合并不必使用检查点,因为可以清楚地定义其输入和输出。
我希望这会有所帮助:),但也许我一直在解决您的问题以外的问题。不幸的是,这个问题对我来说还不是很清楚。
修改
这是该代码的简化版本,但是现在应该很容易实现最后一部分。它适用于您在示例中给出的文件夹结构:
import os
import glob
def get_merged_barcodes(wildcards):
tmpdir = checkpoints.barcode.get(**wildcards).output[0] # this forces the checkpoint to be executed before we continue
barcodes = set() # a set is like a list, but only stores unique values
for folder in os.listdir(tmpdir):
for barcode in os.listdir(tmpdir + "/" + folder):
barcodes.add(barcode)
mergedfiles = ["results/" + barcode + ".merged.fastq" for barcode in barcodes]
return mergedfiles
rule all:
input:
get_merged_barcodes
checkpoint barcode:
input:
rules.basecall.output
output:
directory("results")
shell:
"""
stuff
"""
def get_merged_input(wildcards):
return glob.glob(f"results/**/{wildcards.barcode}/*.fastq")
rule merge_files:
input:
get_merged_input
output:
"results/{barcode}.merged.fastq"
shell:
"""
echo {input}
"""
基本上,您在原始问题中所做的工作都差不多了!