简短说明:
我需要基于文件名来收集输入文件,而无需使用expand()
,我正在尝试根据以下代码创建自定义函数,但是它不起作用。
详细解释:
我有以下python代码与我需要做的非常相似,但是需要一些修复和调整才能适应snakemake:
import glob
import pandas as pd
import os
samples = pd.read_csv('samples.tsv', sep='\t', dtype=str).set_index(["flowcell", "sample", "lane"], drop=False)
samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels]) # enforce str in index
filename = pd.Series(samples['sample']).unique()
rgs = sorted(glob.glob(os.path.join('Outputs/MergeBamAlignment/%s*.bam' % filename)))
print(rgs)
我使用了这个tsv文件:
flowcell sample library lane R1 R2
FlowCellX SAMPLE1 libZ L001 fastq/Sample1.R1.fastq.gz fastq/Sample1.R2.fastq.gz
FlowCellX SAMPLE1 libZ L002 fastq/Sample1.R1.fastq.gz fastq/Sample1.R2.fastq.gz
FlowCellX SAMPLE1 libY L003 fastq/Sample1.R1.fastq.gz fastq/Sample1.R2.fastq.gz
FlowCellY SAMPLE2 libY L003 fastq/Sample2.R1.fastq.gz fastq/Sample2.R2.fastq.gz
FlowCellX SAMPLE2 libY L003 fastq/Sample2.R1.fastq.gz fastq/Sample2.R2.fastq.gz
当我运行代码时,我得到以下输出:
['Outputs/MergeBamAlignment/SAMPLE1_L001_FlowCellX.merged.bam', 'Outputs/MergeBamAlignment/SAMPLE1_L002_FlowCellX.merged.bam', 'Outputs/MergeBamAlignment/SAMPLE1_L003_FlowCellX.merged.bam', 'Outputs/MergeBamAlignment/SAMPLE2_L003_FlowCellX.merged.bam', 'Outputs/MergeBamAlignment/SAMPLE2_L003_FlowCellY.merged.bam']
我认为这是错误的,它应该(可能)是这样的:
['Outputs/MergeBamAlignment/SAMPLE1_L001_FlowCellX.merged.bam', 'Outputs/MergeBamAlignment/SAMPLE1_L002_FlowCellX.merged.bam', 'Outputs/MergeBamAlignment/SAMPLE1_L003_FlowCellX.merged.bam']
['Outputs/MergeBamAlignment/SAMPLE2_L003_FlowCellX.merged.bam', 'Outputs/MergeBamAlignment/SAMPLE2_L003_FlowCellY.merged.bam']
我可以使用此for循环产生正确的输出:
for s in pd.Series(samples['sample']).unique():
filename = glob.glob(os.path.join('Outputs/MergeBamAlignment/%s*.bam' % s))
print filename
但是我知道python函数不允许for循环,所以这是不可能的。
那么如何在不使用expand()
函数而是使用自定义函数的情况下,用每个样本的输入文件生成列表(?)? expand()
之所以成为不可能,是因为my previous question的答案违反了先前的规则,因为{{sample}}
似乎在先前规则的{{1}中使zip
陷入混乱}功能。
任何建议将不胜感激。