我有以下文件名,并希望将它们分组:
组别1:
C7_S6_L001.sorted.bam
C7_S6_L002.sorted.bam
C7_S6_L003.sorted.bam
C7_S6_L004.sorted.bam
第2组:
CL3_S8_L001.sorted.bam
CL3_S8_L002.sorted.bam
CL3_S8_L003.sorted.bam
CL3_S8_L004.sorted.bam
组3:
CL5-B1_S4_L001.sorted.bam
CL5-B1_S4_L002.sorted.bam
CL5-B1_S4_L003.sorted.bam
CL5-B1_S4_L004.sorted.bam
正则表达式如何查找它?
提前谢谢。
答案 0 :(得分:1)
假设分组键是([A-Z0-9-_]+)_L\d{3}\.sorted\.bam
之前的所有内容以及文件名开头的数字,您可以使用以下正则表达式使用保存组来捕获组密钥:
from collections import defaultdict
from pprint import pprint
import re
filenames = [
"C7_S6_L001.sorted.bam",
"C7_S6_L002.sorted.bam",
"C7_S6_L003.sorted.bam",
"C7_S6_L004.sorted.bam",
"CL3_S8_L001.sorted.bam",
"CL3_S8_L002.sorted.bam",
"CL3_S8_L003.sorted.bam",
"CL3_S8_L004.sorted.bam",
"CL5-B1_S4_L001.sorted.bam",
"CL5-B1_S4_L002.sorted.bam",
"CL5-B1_S4_L003.sorted.bam",
"CL5-B1_S4_L004.sorted.bam"
]
pattern = re.compile(r"([A-Z0-9-_]+)_L\d{3}\.sorted\.bam")
grouped = defaultdict(list)
for filename in filenames:
match = pattern.search(filename)
if match:
key = match.group(1)
grouped[key].append(filename)
pprint(grouped)
使用defaultdict
collection的工作示例:
defaultdict(<class 'list'>,
{'C7_S6': ['C7_S6_L001.sorted.bam',
'C7_S6_L002.sorted.bam',
'C7_S6_L003.sorted.bam',
'C7_S6_L004.sorted.bam'],
'CL3_S8': ['CL3_S8_L001.sorted.bam',
'CL3_S8_L002.sorted.bam',
'CL3_S8_L003.sorted.bam',
'CL3_S8_L004.sorted.bam'],
'CL5-B1_S4': ['CL5-B1_S4_L001.sorted.bam',
'CL5-B1_S4_L002.sorted.bam',
'CL5-B1_S4_L003.sorted.bam',
'CL5-B1_S4_L004.sorted.bam']})
打印:
f3e4e0468545: Pushed
656120ad8c56: Pushed
30f9a83f20f3: Pushed
78dbfa5b7cbc: Pushed
invalid checksum digest format