我正在尝试在Python 3中创建一个正则表达式,该正则表达式匹配由未知数量的字符分隔的7个字符(例如> AB0012),然后匹配另外6个字符(例如aaabbb或bbbaaa)。我的输入字符串可能如下所示:
>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
这是我提出的正则表达式:
matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)
print(matches)
我想要产生的输出看起来像这样:
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]
我阅读了Python文档,但是我找不到如何匹配正则表达式的两个部分之间的未知距离。是否有某种通配符可以让我完成我的正则表达式?在此先感谢您的帮助!
编辑:
如果我在代码中使用*?
,请执行以下操作:
mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)
我的输出如下:
[('> CD00192','aaabbb'),('','bbbaaa'),('','aaabbb')]
*列表中的第二项和第三项分别缺少> CD00192和> ZP01990。如何让正则表达式在列表中包含这些字符?
答案 0 :(得分:5)
这是一种非正则表达式方法。拆分“>” (你的数据将从第2个元素开始),然后因为你不关心这7个字符是什么,所以从第8个字符开始检查直到第14个字符。
>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"""
>>> for i in string.split(">")[1:]:
... if i[7:13] in ["aaabbb","bbbaaa"]:
... print ">" + i[:13]
...
>CD00192aaabbb
答案 1 :(得分:1)
我有一个代码也提供了职位。
以下是此代码的简单版本:
import re
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
dic = OrderedDict()
# Finding the result
for mat in regx.finditer(ch):
chunk,head = mat.groups()
headstart = mat.start()
dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)]
# Diplaying the result
for (headstart,head),li in dic.iteritems():
print '{:>10} {}'.format(headstart,head)
for x in li:
print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)
结果
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 8 aaabbb
41 18 bbbaaa
52 29 bbbaaa
62 39 aaabbb
69 ZP01990
95 27 aaabbb
136 SE45789
148 13 aaabbb
172 37 bbbaaa
使用生成器以功能方式使用相同的代码:
import re
from itertools import imap
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch))
dic = OrderedDict(((headstart,head),
[(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)])
for (chunk,head),headstart in gen)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
for (headstart,head),li in dic.iteritems())
我测量了执行的次数。
对于每个代码,我测量了字典的创建和单独显示。
使用发电机的代码(第二个)显示结果(0.020秒)比另一个(0.148秒)快7.4倍
但令我惊讶的是,使用生成器的代码比其他代码(0.000418秒)多花了47%的时间(0.000718秒)来计算字典。
另一种方法:
import re
from collections import OrderedDict
from itertools import imap
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')
def collect(ch):
li = []
dic = OrderedDict()
gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
for st,g1,g2 in gen:
if g1:
if li:
dic[(stprec,g1prec)] = li
li,stprec,g1prec = [],st,g1
elif g2:
li.append((st,g2))
if li:
dic[(stprec,g1prec)] = li
return dic
dic = collect(ch)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]}'.format,li))
for (headstart,head),li in dic.iteritems())
结果
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 aaabbb
41 bbbaaa
52 bbbaaa
62 aaabbb
69 ZP01990
95 aaabbb
136 SE45789
148 aaabbb
172 bbbaaa
此代码以0.00040秒计算dic并以0.0321秒显示
要回答你的问题,你没有其他可能性,只能将每个当前值保存在'CD00192','ZP01990','SE45789'等名称之下(我不想在变量中说“) “在Python中,因为Python中没有变量。但你可以在名称”下读取“,好像我在变量”中写了“
为此,您必须使用 finditer()
以下是此解决方案的代码:
import re
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')
matches = []
for mat in regx.finditer(ch):
g1,g2= mat.groups()
if g1:
head = g1
else:
matches.append((head,g2))
print matches
结果
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]
我之前的代码更复杂,因为它们捕获位置并在列表中收集'CD00192','ZP01990','SE45789'等中一个标题的值'aaabbb'和'bbbaaa'。
答案 2 :(得分:0)
可以使用*
匹配零个或多个字符,因此a*
将匹配""
,"a"
,"aa"
等+
匹配一个或多个角色。
您可能也希望使用+
或*
来使量词(+?
或*?
)变得懒惰。
有关详细信息,请参阅regular-expressions.info。
答案 3 :(得分:0)
试试这个:
>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]