我是python中使用正则表达式的新手。我无法弄清楚如何执行以下操作:
我有一堆文字说明作为字符串,如下所示:
FX0XST001ALF89 OLIGO: Bacillus_cand1=ATGCGGTTCAAAATGTTATC
FILE:/home/AAFC-AAC/fungs/biodiversity/pipelines/454PipelineOutput/v7_newest_testrun_full/rs75/plate1/FX0XST001.MID13/FX0XST001.MID13.sff.trim.fasta
Project: SAGES SFF: FX0XST001 SFF.MID: FX0XST001.MID13
Plate: 1.1 MID_all: MID13 MID: 13 Sample: BK104
Collector: BK Year: 2008 Week: Year_Week:
Location: Ottawa_ON City: Ottawa Province: ON Crop:
Treatment: Substrate_all: Air Substrate: Air Target: Bacteria
Forward Primer: Bac16S27F Reverse Primer: Bac16S690R Taq: T
我希望能够提取这个大字符串中的类别并将它们存储到数据库或其他内容中,例如:
Year: 2008
Sample: BK104
Collector: BK
etc...
如何在python中使用正则表达式来实现这一目标?
我正在考虑使用搜索:
match = re.search(r'Sample:\w\w\w\w\w', theTextDescription)
问题是每个“字段”中文本的长度是不同的。我真的不知道如何考虑这一点
答案 0 :(得分:2)
类似的东西,您可以使用\w+
将字符匹配任意数量的长度:
In [37]: strs
Out[37]: 'FX0XST001ALF89 OLIGO: Bacillus_cand1=ATGCGGTTCAAAATGTTATC \nFILE:/home/AAFC-AAC/fungs/biodiversity/pipelines/454PipelineOutput/v7_newest_testrun_full/rs75/plate1/FX0XST001.MID13/FX0XST001.MID13.sff.trim.fasta \nProject: SAGES SFF: FX0XST001 SFF.MID: FX0XST001.MID13 \nPlate: 1.1 MID_all: MID13 MID: 13 Sample: BK104 \nCollector: BK Year: 2008 Week: Year_Week: \nLocation: Ottawa_ON City: Ottawa Province: ON Crop: \nTreatment: Substrate_all: Air Substrate: Air Target: Bacteria \nForward Primer: Bac16S27F Reverse Primer: Bac16S690R Taq: T'
In [38]: re.findall(r"\w+:\s\w+",strs)
Out[38]:
['OLIGO: Bacillus_cand1',
'Project: SAGES',
'SFF: FX0XST001',
'MID: FX0XST001',
'Plate: 1',
'MID_all: MID13',
'MID: 13',
'Sample: BK104',
'Collector: BK',
'Year: 2008',
'Location: Ottawa_ON',
'City: Ottawa',
'Province: ON',
'Substrate_all: Air',
'Substrate: Air',
'Target: Bacteria',
'Primer: Bac16S27F',
'Primer: Bac16S690R',
'Taq: T']
或者可以将其存储在字典中:
In [39]: dict(x.split(":") for x in re.findall(r"\w+:\s\w+",strs))
Out[39]:
{'City': ' Ottawa',
'Collector': ' BK',
'Location': ' Ottawa_ON',
'MID': ' 13',
'MID_all': ' MID13',
'OLIGO': ' Bacillus_cand1',
'Plate': ' 1',
'Primer': ' Bac16S690R',
'Project': ' SAGES',
'Province': ' ON',
'SFF': ' FX0XST001',
'Sample': ' BK104',
'Substrate': ' Air',
'Substrate_all': ' Air',
'Taq': ' T',
'Target': ' Bacteria',
'Year': ' 2008'}
答案 1 :(得分:1)
使用正则表达式语言的量词:
?
= 0或1
*
= 0或更多
+
= 1或更多
match = re.search(r'Sample:\s\w+', theTextDescription)